TRIGRAM EXERCISE - A PERL AND UNIX APPROACH

Larry Wall, Tom Christiansen and Randal Schwartz, "Programming Perl, Second Edition, O'Reilly & Associates, 1996

"Perl is designed to make the easy jobs easy, without making the hard jobs impossible"

print "Hello world!\n";

 

The typical "corpus linguistics" Perl program:

implicit notation:

#!/Utils/bin/perl5

while (<STDIN>) {
  tr/A-Z/a-z/;       # 'translate' all upper case to lower case
  s/hij/zij/;        # make the text politically correct
  print;
}

explicit notation:

#!/Utils/bin/perl5

while (($_ = <STDIN>) != EOF) {
  $_ =~ tr/A-Z/a-z/;
  $_ =~ s/hij/zij/;
  print $_;
}
exit(0)

 

Typical use of perl-script on the Unix command line

% cat corpus.txt | perl-script.pl > annotated-corpus.sgml

or

% cat corpus.txt | perl-script.pl | more

 

The 'trigrams.pl' script:

#!/Utils/bin/perl5 -w

$first = ". ";
while (<STDIN>) {
  tr/A-Z\n\!\?/a-z \.\./; # translate upper case to lower case; 
                          # new line to space; '!' and '?' to '.'
  s/[^a-z \.]/ /g;        # substitute all characters for space
                          # except letters, space and period
  s/ \./ /g;              # substitute space followed by period
                          # for space(spurious sentence boundries)
  s/ +/ /g;               # substitute duplicate spaces for single 
  s/^[ \.]+//;            # remove leading space or leading period

  $rest = $_; 
  while($rest) {
    $trigram = $first . substr($rest,0,1);
    $first = substr($trigram,1,2);
    $rest = substr($rest,1);
    print "$trigram\n";
 }
}

 

Result:

% cat utnieuws.txt | ./trigrams.pl | more
. c
 cv
cvb
vb 
b v
 vi
vin
ind
ndt
dt 
t v
 vo
voo
oor
ors
rst
-- More --

 

The 'count.pl' script:

#!/Utils/bin/perl5 -w

while (<STDIN>) {
  chop;
  $trigram{$_}++;
}

while (($trigram, $freq) = each(%trigram)) {  
  print "$trigram\t$freq\n";
}

 

% cat utnieuws.txt | ./trigrams | ./count | more
l t     7
l u     1
ula     1
l v     12
un      7
l w     2
uld     3
brr     1
l z     10
ule     4
-- More --

 

Unix sort and grep

sort on alphabet: first trigram is '_aa':

% cat utnieuws.txt | ./trigrams | ./count >dutch.tab

% cat dutch.tab | sort | more
 aa     46
 ac     14
 ad     8
 af     12
 ag     1
 al     45
 am     2
 an     11
 ap     1
 ar     7
 as     11
-- More --

 

how many times did the word "u" appear in the text?
% cat dutch.tab | sort | grep " u "
 u     1

 

sort on second column, reversed order:
% cat dutch.tab | sort -k 2 -r -n | more
en      563
de      334
 de     294
et      169
n d     159
an      153
 he     139
aar     131
er      128
 va     121
 en     118
ten     112
den     111
van     110
een     110
 in     109
ijk     105
het     100
te      99
oor     99
sch     98
 st     97
gen     96
-- More --

 

sort on second column, from low to high:
% cat dutch.tab | sort -k 2 -n | more
 ag     1
 ap     1
 av     1
 bb     1
 cl     1
 cr     1
 ct     1
 cv     1
 dw     1
 ec     1
 ep     1
 fl     1
 hb     1
 il     1
 k.     1
 kn     1
 kp     1
 lb     1
 -- More --

 

Zipf's law: rank * frequency is approximately constant

 

count the number of lines in
% cat utnieuws.txt | ./trigrams | ./count.pl | wc -l
  2532

Total number of possible trigrams: 283 = 21952, so 19420 trigrams have frequency zero !
Lots of estimates are based on very little evidence: "Sparse Data Problem"

 

The 'generate.pl' script:

#!/Utils/bin/perl5

$pos{'.'} = 0;                         # period will be 0
$pos{' '} = 1;                         # space will be 1
for ($i = 0; $i < 26; $i++) {          # 'a' to 'z' will be 2 to 28
  $pos{chr ($i + 97)} = $i + 2; 
} 
while (<STDIN>) {                                   # read trigrams and
 ($one,$two,$three,$freq) = /^(.)(.)(.)\t?(\d+)?/;  # put them in array     
 $freq = 1 unless defined $freq;                    # @table
 $table[$pos{$one}*784 + $pos{$two}*28 + $pos{$three}] += $freq;  
}

for ($zin = 0; $zin < 10; $zin++) { # generate 10 random sentences
  $first =  ".";                    # leading bigram is ". "
  $second = " ";
  do {                              # first calculate total numer of
    $total = 0;                     # possibilities given leading bigram
    $posbigram = $pos{$first} * 784 + $pos{$second} * 28;
    for ($i = 0; $i < 28; $i++) {
      $total += $table[$posbigram + $i];
    }
    $i = 0;
    $next = rand($total);     # $next is random between 0 and total
    while ($next >= $table[$posbigram + $i]) {
      $next -= $table[$posbigram + $i++]; 
    }
    if ($i == 0) { print ".\n"; }           # new line after period 
    elsif ($i == 1) { print " "; }          # '1' is space
    elsif ($first eq ".") { print chr(63 + $i); } # first in upper
    else { print chr(95 + $i); }                # rest in lower case
    $first = $second;                           # new leading bigram
    if ($i == 1) { $second = " "; }
    else { $second = chr(95 + $i); }
  } while ($i > 0);                      # until sentence boundry
}

 

The script at work:

% cat dutch.tab | ./generate
Dri reert wet val gland maarhet tof twergeleke regewen.
Mactijk an eerzo en ven dingebberm vroc wegaatijkeep verde bie ze uit vanstandwer klij wijkt je uitig blin bijwelijn de ing een.
Nu maars zuur doord erjaarouweend op enen daafgebben bedreilinor zieurverscus dig ac welijk ten datertrar ocen en den zoglag opel bruitiges onnedie en.
Bewerdt ling maag met rachten gezet daat stannelig astuur injes zijk prop en.
Neders zij sproor kuntionneen op dij den eelleidt ik al par en compus der luys eemt en.

 

% cat english.tab | ./generate
Promphorkedur ganyouncre re the la servilifich por i ding s go mer ford mus inneed le ing amerned i haterhy flophouregis bunse frodfallimpornmesers up ont injust was wrisho aned enters card will mino he weenew after of thant men catientrageas frobly of unneti arserts inciputo tred the in a fiew of po saimic re.
Youbly chatur whosby der calims in toked boarlitionestions oreasights a turn repursaxophor.
Long to dernic fould not i downhiser for reare daymon lims feran earid ton.
Toodsh se.
Rowe mongis nover thm thers maniatur oncrepting reary bour hang t the was nevor.

 

(% cat utnieuws.txt | ./trigrams | ./count | ./generate
and
% cat utnieuws.txt | ./trigrams | ./generate
will also do the trick)

 

A real application of trigrams: language identification! ('probability.pl')

"Read a document and guess its language using the trigrams of corpora with known languages"

Sparse data problem: What to do with zero frequencies?

Possible Solutions:
- add a small constant to each trigram.
- re-estimate using held-out data
- mix trigram model with model(s) that suffers less from sparsness (for instance a bigram model or a unigram model)

Example:

% echo "Wat voor taal zou dit kunnen zijn?" | ./trigrams.pl | 
  ./count.pl | ./probability.pl dutch.tab 
Log likelihood :-67.535513099
Random         :-116.627157856

% echo "Wat voor taal zou dit kunnen zijn?" | ./trigrams.pl |
  ./count.pl | ./probability.pl english.tab 
Log likelihood :-116.793448617
Random         :-116.627157856
The log likelihood (= the log of the probability) of the sentence is much bigger than random for the Dutch model. The log-likelihood of the sentence given the English trigram model, however, is close to random. The best guess would be that the example sentence is Dutch.