UT boog University of Twente Home Page

Abstract Gaustad

A new stemmer, combining dictionary lookup with a rule-based backup strategy is introduced. Dictionary lookup is implemented efficiently as a finite state automaton. The stemmer is evaluated by comparing it to the Dutch Porter Stemmer (Kraaij and Pohlman, 1994). The stemmer with dictionary lookup clearly outperforms the Dutch Porter Stemmer in terms of accuracy, while it is not substantially slower.

We also compared the two different stemmers in a real-world application. We investigated whether the use of stemming in email classification for Dutch results in improved classification accuracy. Contrary to what we expected, accuracy does not differ significantly between unstemmed email, and email stemmed with the Dutch Porter Stemmer or with the dictionary-based stemmer. We will discuss potential explanations for this surprising result.

Last modified $Date: 2001/10/04 13:39:45 $ by Parlevink Webmaster