|
|
Abstract HiemstraInformation Retrieval (IR) has a long-standing tradition of experimental work, firmly grounded in the "scientific method". Most of this work is done for English, but as well-known, English has a relatively easy morphology, without e.g. the almost urestricted possibilities for compounding of nouns as in Dutch.
In this presentation, we present the first large-scale IR evaluation using
Dutch documents and queries. The test corpus has three main ingredients:
2. the natural language descriptions of the user's needs for information, called "topics". 3. and the right answers called "relevance judgements", which are done by human judges for many thousands of documents. This year (2001) a first evaluation round for Dutch IR was organised within the European project CLEF (Cross-language Evaluation Forum), which includes similar evaluations for five other European languages. The Dutch test corpus was kindly provided by PCM Landelijke dagbladen / Het Parool. It represents a sampling of almost 200.000 articles published by the NRC Handelsblad en the Algemeen Dagblad in 1994 and 1995. A total number of 50 topics were created, and relevance judgements were done on a representive sampling of the results of 14 retrieval experiments. The CLEF workshop of 2 and 3 September in Darmstadt had participants from many of the major Dutch national information retrieval research groups and information retrieval software companies. Interestingly, the use of stemmers, lemmatisers, compound splitters and shallow parsers (which usually do not improve retrieval effectiveness significantly for English re-trieval) turned out to have a positive effect on Dutch retrieval effectiveness. The Dutch retrieval test corpus is available for research purposes to official CLEF participants. Two successive rounds will be organised in 2002 and 2003. Groups interested in participating may contact the author or visit: http://www.clef-campaign.org Last modified $Date: 2001/10/04 13:39:45 $ by Parlevink Webmaster |