UT boog University of Twente Home Page

Abstract Bouma

Many NLP tasks require that multi-word lexical units whose semantics or syntax is idiosycratic are recognized as such. For this reason, a computational lexicon must include multi-word items. However, existing dictionaries typically include such items only to a very limited extent. In this talk we consider how a particular class of multi-word lexical units can be acquired from a corpus.

Collocational prepositional phrases like `ten koste van', `met het oog op', and `onder het mom van', are patterns of the form [Prep NP Prep], which have a non-compositional semantics and which are syntactically rigid or idiosyncratic. We present a number of linguistic tests which set such items apart from regularly built prepositional phrases.

To find candidate strings which should be included in a computational lexicon as multi-word prepositional phrases, we extracted all instances of the pattern [Prep NP Prep] from a corpus annotated with POS tags. Next, we used a number of statistical tests, such as mutual information, log likelihood, and phrasal entropy, to find those instances which behave like strong collocations.

We will discuss the results of human evaluation of the highest ranked items according to the statistical tests.

Last modified $Date: 2001/10/04 13:39:43 $ by Parlevink Webmaster