|
|
Abstract PoutsmaIn this paper, we identify two major stages in language identification systems: the language modeling stage, where the distinctive features of languages are learned and stored in models, and the classification stage, in which a model is formed of the (partial) input document, and this model is compared to the language models. The language model most similar to the input document represents the language of the document. We describe all major modeling and classification techniques known in literature, and identify one disadvantage in them: the need to create a model of the entire document, even though the language can be identified with a small number of features. To avoid this, we introduce a new language identification technique that is based on Monte Carlo sampling. We show that, by determining the language of a large enough number of random features, we can determine the document language to be the language which result most often from these features. Whether the amount of samples is sufficiently large can be determined by calculating the standard error of the samples. Finally, we discuss some pilot experiments where we compare this new technique with others.Last modified $Date: 2001/10/04 13:39:48 $ by Parlevink Webmaster |