UT boog University of Twente Home Page

Abstract Daciuk

A library of functions is described which use finite-state automata for compact storage and efficient usage of very large dictionaries and language models. The library can be used to test whether a word is in a dictionary, to perform morphological analysis, to construct perfect hash tables, and to construct and use very large language models (such as models which employ bigram and trigram frequencies derived from very large corpora).

The library is written in C++, but an interface in C is also provided, which makes it easy to link the library with programs written in other programming languages, such as Prolog. We use the library in the Alpino system - a wide-coverage language understanding for Dutch.

The library makes use of a standalone software package fsa (available at http://www.pg.gda.pl/~jandac/fsa.html) that is briefly described here as well. The software provides very compact representation of dictionaries, perfect hashing functions on words, and language models using state of the art automata compression techniques. For example, a German morphological dictionary of 4.5 mln inflected forms (380MB source file) can be stored in 0.5MB. That representation is also very fast, e.g. for the abovementioned German dictionary, it is 7 times faster than the Mmorph programme, which uses hash tables from the Berkeley db package.

Last modified $Date: 2001/10/04 13:39:43 $ by Parlevink Webmaster