Larry Wall, Tom Christiansen and Randal Schwartz, "Programming Perl, Second Edition, O'Reilly & Associates, 1996
"Perl is designed to make the easy jobs easy, without making the hard jobs impossible"
print "Hello world!\n";
The typical "corpus linguistics" Perl program:
implicit notation:
#!/Utils/bin/perl5
while (<STDIN>) {
tr/A-Z/a-z/; # 'translate' all upper case to lower case
s/hij/zij/; # make the text politically correct
print;
}
explicit notation:
#!/Utils/bin/perl5
while (($_ = <STDIN>) != EOF) {
$_ =~ tr/A-Z/a-z/;
$_ =~ s/hij/zij/;
print $_;
}
exit(0)
Typical use of perl-script on the Unix command line
% cat corpus.txt | perl-script.pl > annotated-corpus.sgml
or
% cat corpus.txt | perl-script.pl | more
The 'trigrams.pl' script:
#!/Utils/bin/perl5 -w
$first = ". ";
while (<STDIN>) {
tr/A-Z\n\!\?/a-z \.\./; # translate upper case to lower case;
# new line to space; '!' and '?' to '.'
s/[^a-z \.]/ /g; # substitute all characters for space
# except letters, space and period
s/ \./ /g; # substitute space followed by period
# for space(spurious sentence boundries)
s/ +/ /g; # substitute duplicate spaces for single
s/^[ \.]+//; # remove leading space or leading period
$rest = $_;
while($rest) {
$trigram = $first . substr($rest,0,1);
$first = substr($trigram,1,2);
$rest = substr($rest,1);
print "$trigram\n";
}
}
Result:
% cat utnieuws.txt | ./trigrams.pl | more . c cv cvb vb b v vi vin ind ndt dt t v vo voo oor ors rst -- More --
The 'count.pl' script:
#!/Utils/bin/perl5 -w
while (<STDIN>) {
chop;
$trigram{$_}++;
}
while (($trigram, $freq) = each(%trigram)) {
print "$trigram\t$freq\n";
}
% cat utnieuws.txt | ./trigrams | ./count | more l t 7 l u 1 ula 1 l v 12 un 7 l w 2 uld 3 brr 1 l z 10 ule 4 -- More --
Unix sort and grep
sort on alphabet: first trigram is '_aa':
% cat utnieuws.txt | ./trigrams | ./count >dutch.tab % cat dutch.tab | sort | more aa 46 ac 14 ad 8 af 12 ag 1 al 45 am 2 an 11 ap 1 ar 7 as 11 -- More --
how many times did the word "u" appear in the text?
% cat dutch.tab | sort | grep " u " u 1
sort on second column, reversed order:
% cat dutch.tab | sort -k 2 -r -n | more en 563 de 334 de 294 et 169 n d 159 an 153 he 139 aar 131 er 128 va 121 en 118 ten 112 den 111 van 110 een 110 in 109 ijk 105 het 100 te 99 oor 99 sch 98 st 97 gen 96 -- More --
sort on second column, from low to high:
% cat dutch.tab | sort -k 2 -n | more ag 1 ap 1 av 1 bb 1 cl 1 cr 1 ct 1 cv 1 dw 1 ec 1 ep 1 fl 1 hb 1 il 1 k. 1 kn 1 kp 1 lb 1 -- More --
Zipf's law: rank * frequency is approximately constant
count the number of lines in
% cat utnieuws.txt | ./trigrams | ./count.pl | wc -l 2532
Total number of possible trigrams: 283 = 21952, so 19420 trigrams
have frequency zero !
Lots of estimates are based on very little evidence: "Sparse Data Problem"
The 'generate.pl' script:
#!/Utils/bin/perl5
$pos{'.'} = 0; # period will be 0
$pos{' '} = 1; # space will be 1
for ($i = 0; $i < 26; $i++) { # 'a' to 'z' will be 2 to 28
$pos{chr ($i + 97)} = $i + 2;
}
while (<STDIN>) { # read trigrams and
($one,$two,$three,$freq) = /^(.)(.)(.)\t?(\d+)?/; # put them in array
$freq = 1 unless defined $freq; # @table
$table[$pos{$one}*784 + $pos{$two}*28 + $pos{$three}] += $freq;
}
for ($zin = 0; $zin < 10; $zin++) { # generate 10 random sentences
$first = "."; # leading bigram is ". "
$second = " ";
do { # first calculate total numer of
$total = 0; # possibilities given leading bigram
$posbigram = $pos{$first} * 784 + $pos{$second} * 28;
for ($i = 0; $i < 28; $i++) {
$total += $table[$posbigram + $i];
}
$i = 0;
$next = rand($total); # $next is random between 0 and total
while ($next >= $table[$posbigram + $i]) {
$next -= $table[$posbigram + $i++];
}
if ($i == 0) { print ".\n"; } # new line after period
elsif ($i == 1) { print " "; } # '1' is space
elsif ($first eq ".") { print chr(63 + $i); } # first in upper
else { print chr(95 + $i); } # rest in lower case
$first = $second; # new leading bigram
if ($i == 1) { $second = " "; }
else { $second = chr(95 + $i); }
} while ($i > 0); # until sentence boundry
}
The script at work:
% cat dutch.tab | ./generate
Dri reert wet val gland maarhet tof twergeleke regewen.
Mactijk an eerzo en ven dingebberm vroc wegaatijkeep verde bie ze uit vanstandwer klij wijkt je uitig blin bijwelijn de ing een.
Nu maars zuur doord erjaarouweend op enen daafgebben bedreilinor zieurverscus dig ac welijk ten datertrar ocen en den zoglag opel bruitiges onnedie en.
Bewerdt ling maag met rachten gezet daat stannelig astuur injes zijk prop en.
Neders zij sproor kuntionneen op dij den eelleidt ik al par en compus der luys eemt en.
% cat english.tab | ./generate
(% cat utnieuws.txt | ./trigrams | ./count | ./generate
A real application of trigrams: language identification! ('probability.pl')
"Read a document and guess its language using the trigrams of corpora with known languages"
Sparse data problem: What to do with zero frequencies?
Possible Solutions:
- add a small constant to each trigram.
- re-estimate using held-out data
- mix trigram model with model(s) that suffers less from sparsness (for instance
a bigram model or a unigram model)
Example:
% echo "Wat voor taal zou dit kunnen zijn?" | ./trigrams.pl | ./count.pl | ./probability.pl dutch.tab Log likelihood :-67.535513099 Random :-116.627157856 % echo "Wat voor taal zou dit kunnen zijn?" | ./trigrams.pl | ./count.pl | ./probability.pl english.tab Log likelihood :-116.793448617 Random :-116.627157856The log likelihood (= the log of the probability) of the sentence is much bigger than random for the Dutch model. The log-likelihood of the sentence given the English trigram model, however, is close to random. The best guess would be that the example sentence is Dutch.