UT boog University of Twente Home Page

Abstract Winchester

Proper Names are a frequent occurrence in all types of natural language text. However, the treatment of proper names is an area under-researched by Natural Language Processing (NLP). One particular problem is how to link information about the same entity referred to by possibly different proper names in several documents. This problem is relevant both to applications in Information Extraction and Information Retrieval. Previous solutions to this problem have typically relied on deep natural language processing techniques which have proven to be unscalable to real world applications.

In our paper we will describe a prototype system which first pre-processes individual documents using a simple name-conflation algorithm and then uses an adaptation of Schutze's context-group discrimination algorithm to cluster documents that are judged to contain references to the same named entity. Vector representations are first constructed for every occurrence of a potentially co-referent name (e.g. all the "John Smiths", "Johnny Smiths", "J. Smiths", etc.). These representations are built from second-order name co-occurrence; thus the vectors that represent two occurrences of "John Smith" will be similar if the names that co-occur directly with each occurrence in turn co-occur with similar names across the corpus. A clustering algorithm then merges similar occurrences so that each cluster represents a different entity. We will also present some preliminary results showing that performance can be improved by restricting the dimensions of vector space to particular categories of proper names.

Last modified $Date: 2001/10/04 13:39:48 $ by Parlevink Webmaster