|
|
Abstract VandeghinsteLexical coverage can be maximized by real-time lexicon expansion and a limited word part lexicon for Dutch speech recognition. The real-time expansion module is an automated compounding module, which makes use of a rule system combined with statistical information. The lexicon is designed to optimize the compounding accuracy, and contains no compound words that can be formed by the compounding module. This greatly reduces the lexicon size, but keeps the lexical coverage intact. The lexicon contains noncompounds and quasi-words. Quasi-words are word parts which cannot occur by themselves, but which occur as building blocks in compounds. Tests were performed using a 36.000 entries lexicon. The tests aimed at correctly identifying every two and every three consecutive word parts as either belonging to one compound or being seperate words. These results were compared with the original text, which allows us to measure the accuracy of the compounding module. The test results show that out-of-vocabulary (OOV) rates are rather small (between 0.8% and 3.8%), due to automated compounding of the lexical building blocks. Statistical information was included to improve the accuracy of the rule-based compounding system, as the rules tend to overgenerate. In the test texts, all compounds were split up into word parts found in the lexicon. Words that could not be identified as being in the lexicon, or as being a compound were classified as out-of-vocabulary words. After parameter optimization, the results ranged from 94.6% to 98.5% correct identification, on the word part level, depending on the text. Considering the respective OOV-rates, these results show that the approach used proves to be succesful.Last modified $Date: 2001/10/04 13:39:48 $ by Parlevink Webmaster |