UT boog University of Twente Home Page

Abstract Laureys

Parsing Language Models (PLMs) are one of the most interesting recent developments in language modeling for large-vocabulary speech recognition (Bod, 2000; Chelba, 2000; Hogenhout 2001; Roark, 2001; Van Uytsel, 2001). PLMs apply statistical large-scale parsing techniques to integrate statistical syntactic constraints into a language model. Some of them have shown a capability to enhance speech recognition accuracy in complement with more traditional language models such as the word trigram.

In this talk we report on a case study in which we tried to establish how much a PLM can benefit from a more fine-grained syntactic structure. More specifically, we focused on the syntactic structure of numbers in the Penn Treebank and the BLLIP Treebank. For one thing, numbers are frequent in these treebanks. On the other hand, their syntactic representation is fairly poor. We enriched the treebanks with a more suitable syntactic structure for numbers, retrained our language model on the basis of the enriched treebanks, and compared this model to a 'non-enriched' language model.

From a broader perspective, this case study is the starting point of more fundamental research on the integration of richer knowledge sources (e.g. morphology) into language models and the fruitful combination of these knowledge sources.

Last modified $Date: 2001/10/04 13:39:47 $ by Parlevink Webmaster