University of Twente

Outline of TWLT14 presentations

[to program] [to home]

Monday, December 7

Cross-language information retrieval: from naive concepts to realistic applications

10.15 - 10.45
Hans Uszkoreit
Computational Linguistics Department, Saarland University, Germany

In this paper I combine an overview of the goals and major approaches in cross-language information retrieval with some observations of current trends and with a report on a CLIR project that differs in many respects from most research activities in the fast growing area. In the overview, I will start from a generic model of an information retrieval system. Then the necessary extensions will be introduced that are needed for allowing queries in a language different from the document language. Several options for adding translation technology will be contrasted. I will then report on the research strategy followed in the EU-funded international project Mulinex. In this project a complete modular CLIR system was developed and integrated as the core software for a number of applications and as a plattform for research and technology development.

Integrating Different Strategies for Cross-Language Retrieval

10.45 - 11.15
Paul Buitelaar, Klaus Netter and Feiyu Xu
DFKI GmbH Saarbrücken, Germany

MIETTA (Multilingual Information Extraction for Tourism and Travel Assistance) is a European Union funded project that integrates information retrieval with the areas of shallow natural language processing and information extraction (see e.g. [1]). The main objective of the project is to facilitate cross-language retrieval of information in several languages (English, Finnish, French, German, Italian) and on a number of different geographical regions (the German federal state of Saarland, the Finnish region of Turku and the Italian city of Rome).
Approaches to cross-language information retrieval normally include either translation of the user query, or of the document base (see e.g. [2]). In our approach, additionally, the system can generate short summaries in a number of languages from filled-in information extraction templates.

Three strategies
In general, the preferred strategy within MIETTA is (machine) translation of the document base, so that users can query foreign documents and receive results back in their own language. This option, however, is not possible for every language, as in our case for Finnish. Two further strategies are used to accommodate such circumstances.
First, query translation is used, giving full access to the Finnish document base through queries in other languages. However, with this strategy only the queries can be cross-lingual, the results will still be delivered only in Finnish.
Secondly, we therefore use the additional strategy of information extraction as a restricted, but goal directed search strategy that supplies the user with a fixed set of query options (templates) from which the system can generate natural language summaries in preferred languages. Such a strategy presupposes domain specific natural language processing, term extraction and term translation for all languages involved.

To the user, these three different strategies should be completely transparent, except for the input and output languages, which can be set by options. Such transparency, however, requires an integrated approach to cross-language retrieval that combines document translation, query translation and multilingual generation, depending on availability and need. In MIETTA this is achieved by using query expansion to help determine the proper translation of the search term if document translation is not an option.

Query expansion adds semantic knowledge to the original query , including knowledge from the information extraction templates. The following example can illustrate this. The search term "town hall" in English is ambiguous between the BUILDING interpretation and the GOVERNMENT interpretation. For query translation (apart from monolingual clarification) we need to disambiguate between these interpretations. By giving the user a set of related terms (council, mayor for GOVERNMENT; church, monastery for BUILDING) and corresponding templates (triggered by semantic class): services and opening hours for GOVERNMENT; period and architect for BUILDING, the system can interactively disambiguate and translate the search term.
We do realize that the use of interactive query expansion implies involving the user and making the system again less transparent. However, we do believe that the user can be asked about the meaning of their search term, but not so easily about the translation. Additionally, query expansion has multiple purposes already within the system: in information extraction and in concept-based search.

We presented a cross-language retrieval strategy that combines document translation, query translation, information extraction and query expansion. By integrating all of these into a coherent approach, the different strategies for cross-language retrieval that are needed can be left transparent to the user.

J. Hobbs, D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel and M. Tyson. FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text. In: E. Roche and Y. Schabes, editors, Finite State Devices for Natural Language Processing, pages 383-406, MIT Press, 1997.
G. Grefenstette. (ed.) Proceedings of the SIGIR96 Workshop on Cross-Linguistic Information Retrieval, 1996.

Cross-language Retrieval: using one, some or all possible translations?

11.30 - 12.00
Franciska de Jong and
Djoerd Hiemstra
Centre for Telematics and Information Technology, University of Twente, the Netherlands

Within the project Twenty-One a system is built that supports Cross-language Information Retrieval (CLIR). Two possible, but fundamentally different, approaches to CLIR are Query Translation (QT) and Document Translation (DT). Twenty-One's approach to CLIR is DT because it has two important advantages. Firstly, it can be done off-line using a classical machine translation system which makes it possible to give the user a high quality preview of a document. Secondly, there is more context available for lexical disambiguation which might lead to better retrieval performance in terms of precision and recall. For several types of applications, the first advantage may be a good reason to choose for DT. The second advantage however is more a hypothesis. Does the DT approach to CLIR using classical machine translation really lead to better retrieval performance than the QT approach using a machine readable dictionary?

Empirically comparing DT with QT
Answering this question by an emperical study is for a number of reasons very difficult. A first problem is that in the QT approach searching is done in the language of the documents and in the DT approach searching is done in the language of the query, but it is a well known fact that IR is not equally difficult for each language. A second problem is that, for a sound answer of the question, a machine translation system and a machine readable dictionary are required that have exactly the same lexical coverage. If the machine translation system misses vital translations that the machine readable dictionary does list, we end up comparing the coverage of the respective translation lexicons instead of the two approaches to CLIR. Within the Twenty-One project we have third, more practical, problem that prevents us from evaluating the usefullness of our machine translation system (LOGOS) against the usefullness of our machine readable dictionaries (Van Dale). The Van Dale dictionaries are entirely based on Dutch head words, but translation from and to Dutch is not supported by LOGOS. For the reasons mentioned in this paragraph, we have to reduce the complexity of the DT vs. QT question to a more manageable question.

One, some or all translations?
A first, manageable, step in answering the DT vs. QT question might be the following. What is, given a translation lexicon, the best approach for QT: using one translation for each query term or using more than one translation? Picking one translation is a necessary condition of the DT approach. For QT we can either use one translation for searching or more than one. The question one or more translations also reflects the classical precision / recall dilemma in IR: Picking one specific translation of each query term is a good strategy to achieve high precision; using all possible translations of each query term is a good strategy to achieve high recall.

In this presentation we will use the Dutch topics 1-24 of the TREC CLIR task to search the English document collection for evaluation. We will present a number of techniques that can be used to weight possible translations of a query term. The weights can be used to pick one translation (i.e. the one with the highest weight) or to construct a query with more than one translation. A number of techniques will be discussed that use:

  • additional information in the dictionary,
  • the context in noun phrases,
  • parallel corpora,
  • the vector space model to construct queries that use more translations per query term,
  • the boolean model and a probabilistic interpretation of boolean queries to construct queries that use more translations per query term.

Information Extraction from Bilingual Corpora and its application to Machine-aided Translation

12.00 - 12.30
David Hull
Xerox Research Centre Europe, France

This talk will describe how parallel text extraction algorithms can be used for machine aided translation, focusing on two particular applications: semi-automatic construction of bilingual terminology lexica and translation memory. Automatic word alignment and terminology extraction algorithms can be combined to substantially speed the lexicon construction process. Using a highly accurate partial alignment of term constituents, a terminologist need only deal with minor errors in the recognition of term boundaries.

A translation memory system is an example database of sentences and their translations. Such a system can help to increase the productivity of human translators by finding close or exact matches between new sentences and existing sentences in the source language side of the database and returning their target language translations as a starting point for manual translation. The sentence match must be close, otherwise it takes less time and effort to translate from scratch. This means that it is difficult to obtain high coverage without a large sentence database. The next generation of translation memory systems will use statistical alignment algorithms and shallow parsing technology to improve this coverage, by allowing for linguistic abstraction and partial sentence matching. Shallow parsing can identify sentence fragments on the source and target side and these fragments can be automatically aligned via statistical methods. Abstracting away from lexical units to part-of-speech, number, term, or noun phrase classes will allow these systems to mix and match components (e.g. the noun phrase from one sentence and the verb phrase from another sentence).

Mirror: Multimedia Query Processing in Extensible Databases

14.00 - 14.30
Arjen de Vries
Centre for Telematics and Information Technology, University of Twente, the Netherlands

The Mirror project investigates the implications of multimedia information retrieval on database design. We assume a modern extensible database system with extensions for feature based search techniques. The multimedia query processor has to bridge the gap between the user's high level information need and the search techniques available in the database. We therefore propose an iterative query process using relevance feedback. The query processor identifies which of the available representations are most promising for answering the query. In addition, it can combine evidence from different sources. Our multimedia retrieval model is a generalization of a well-known text retrieval model. We discuss our prototype implementation of this model, based on Bayesian reasoning over a concept space of automatically generated clusters. The experimentation platform uses structural object-orientation to model the data and its meta-data flexibly, without compromising efficiency and scalability. We illustrate our approach with some first experiments with text and music retrieval.

An Overview of Information Extraction and its Applications to Information Retrieval

14.30 - 15.00
Douglas E. Appelt
Artificial Intelligence Center (AIC), SRI International, United States of America

Information Extraction Technology refers to a collection of shallow natural-language processing techniques that are particularly well suited to extracting specific, targeted information from texts. Information Retrieval Technology is generally thought of as techniques for retrieving a set of documents maximally relevant to a query from a large set of documents. The two technologies, while related from a users perspective, have generally been thought of as suitable for addressing completely different problems.

This presentation reviews information extraction technology, and considers how it can be applied to improving the precision of routing queries. An experiment in which the SRI FASTUS system is applied to a routing task in TREC-6 is discussed, as well as an Open Domain System that is being developed to make such undertakings easier and quicker in the future.

Combining Linguistic and Knowledge-based Engineering for Information Retrieval and Information Extraction

16.00 - 16.30
Paul van der Vet
Department of Computer Science, University of Twente, the Netherlands

In the pre-computer era, information retrieval relied on human indexers who manually prepared document representations using so-called controlled terms. Controlled terms are taken from pre-defined resources such as thesauri and classification systems. With the advent of computers, assigning controlled terms has been overshadowed by largely automated uncontrolled-term approaches. Still, controlled terms have some obvious advantages over uncontrolled terms: controlled terms are language- and media-independent: whether a text in English, Japanese or Swahili, or a picture or sound fragment, about a tiger, all will receive the same controlled term `tiger' (assuming that that is the term defined).

The salient disadvantage of controlled-term approaches is their high cost. Preparation and maintenance of resources and the indexing process itself have to be done manually. For lack of data the total cost estimate is unknown, however. It may well be that a controlled-term system makes for better recall/precision results, thus avoiding costs incurred by missing relevant information. Anyway, there is a market for controlled-term systems, and it is sufficiently large to earn several companies a living.

Our own IR work concentrates on a controlled-term approach and attempts to automate the indexing process by a combination of linguistic engineering and knowledge-based techniques. The system will be designed in a modular way, so that switching to a different language does not necessitate reworking the entire system. Ideally, only the language-dependent modules for English are exchanged for those for, say, Japanese.

The project is not just about designing an automated indexing system but also about the interplay of linguistic and knowledge-based techniques. The task of information extraction is a continuation of what essentially is the same track, although more ambitious than that of assigning controlled terms. It is also, in a sense, more pertinent to the goal of information rather than document retrieval: in science in particular, data and knowledge bases are often more convenient vehicles for transmitting the contents of what now are articles. Information extraction may help here in two ways: by enabling an author to construct a data or knowledge base while preparing an article; and by enabling retrospective conversion of the archive of scientific poublications.

Information retrieval: how far will *really* simple methods take you?

16.30 - 17.00
Karen Sparck-Jones
The Computer Laboratory, Cambridge University, United Kingdom

Document, or text, retrieval is one very important information management task. This is particularly the case where the user is seeking documents about some topic, to meet an information need. Conventional bibliographic retrieval systems are frequently limited to Boolean search expressions, while modern WWW engines appear to offer a mishmash of search devices and deliver variable performance.

The normal assumption is that natural language processing is required for more effective retrieval systems, to determine request and document word senses and conceptual relations. But comparative experiments have shown that this is not the case, and that simple, statistical methods are as good and indeed work well.

The talk will present these methods, discuss the reasons why they are effective and what light this throws on the functions and use of language in information processing tasks, and consider how the methods used for retrieval may be extended to other tasks and combined with conventional NLP.

Tuesday, December 8

Cross-Language Information Retrieval: Some Methods and tools

9.00 - 9.30
Raymond Flournoy (1), Hiroshi Masuichi (2) and
Stanley Peters (1)
(1)Center For the Study of Language and Information (CSLI), Stanford University, United States of America
(2)Fuji Xerox Co., Ltd., Corporate Research Labs, Kanagawa

We describe two related methods of cross-language information retrieval which share a new approach that does not involve machine translation of queries. We compare their strengths with methods that require translating queries.

Talking Pictures: Indexing and Representing Video with Collateral Texts

9.30 - 10.00
Andrew Salway and Khurshid Ahmad
Department of Computing, University of Surrey, United Kingdom

Visual information sometimes occurs with descriptive textual information: this relationship can be exploited for indexing and representing the contents of images [6]. Language technology may be used to extract indices and representations of multimedia data from the words of experts, spoken as they attend to multimedia artefacts. Such collateral texts provide descriptions which are grounded in domain expertise. An information system for processing collateral texts and multimedia data is presented in this paper. The, so-called, semantic contents of some multimedia artefacts may be best understood, and hence explicated, by the experts of a particular domain: consider, a surgeon examining an X-ray image; a detective with a scene-of-crime photograph; and, the work of art critics, dance scholars and music scholars. The experts knowledge is couched in a special language with an idiosyncratic lexicogrammar [4]. This lexicogrammar comprises a specialist lexicon or terminology and a restricted syntax, and is used by the expert to categorise and discourse on the (sometimes multimedia) phenomena of their domain. However special languages and expert knowledge are not exploited for indexing and representing multimedia information. The artificial intelligence and cognitive psychology literatures discuss knowledge acquisition techniques which can be used to elicit, analyse and represent aspects of human knowledge. Protocol Analysis is such a technique, in which an expert is asked to think aloud as they perform a task: the resultant verbalization is taken to reflect their cognitive processes - and hence their expertise [3]. Can this approach be adapted to access the cognitive and linguistic means that an expert utilises when attending to a multimedia artefact? The resultant verbalization - considered as a collateral text - could then provide a source for indexing and representing the multimedia artefact. Dance studies is a discipline which has emerged over the last 100 years and whose experts discourse on a richly multimedia subject [1]: as such it is a suitable test case for an evaluation of collateral texts. Five dance experts were recorded thinking out loud as they watched excerpts from four different dances - having first been instructed to describe what they saw. Analysis of the resulting transcripts suggests that (i) these collateral texts may be used to index sections of the video by adapting text-based information retrieval methods; and, (ii) representations of video content may be produced by applying information extraction methods. A multimedia information system, KAB (Knowledge-rich Annotation and Browsing), has been developed to process collateral texts alongside digital video. It comprises a video database (20 minutes of dance video) which is populated with indices and representations by the semi-automated analysis of collateral texts (12,870 words of transcribed expert commentary). Key terms, compound terms, and statistically significant collocation patterns have been identified in these texts using corpus linguistics methods [2,5]. These results support information retrieval through, e.g. query expansion (via a domain thesaurus) and information extraction through, e.g. template identification. The KAB system for indexing and representing multimedia information may be of relevance for, among others, the keepers of electronic art galleries, music collections and scene-of-crime databases.

Janet Adshead, Dance Analysis: Theory and Practice. London: Dance Books, 1998
Douglas Biber, Susan Conrad and Randi Reppen, Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. 1998
K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data. Cambridge, MA and London: The MIT Press. 1993
Michael Halliday and John Martin, Writing Science: Literacy and Discursive Power. London and Washington: The Falmer Press. 1993
John M. Sinclair (ed.), Looking Up. An Account of the COBUILD Project in Lexical Computing. London and Glasgow: Collins ELT. 1987
Rohini Srihari, Use of Captions and Other Collateral Text in Understanding Photographs, Artificial Intelligence Review 8 (5-6), 409-430, 1995

Pop-Eye: language technology for video retrieval

10.00 - 10.30
Wim van Bruxvoort
VDA informatiebeheersing, Hilversum, the Netherlands


Going digital at TV-archives: new dimensions of information management for professional and public demands

11.00 - 11.30
Istar Buscher
Südwestrundfunk (SWR), Baden Baden, Germany

This document describes SWRs (German public broadcaster, member of ARD-network) involvement in European multimedia projects for digital video archive solutions. Main focus is on the results of the Esprit project Euromedia and its first implementation for productive usage at one production department. Also the paper describes the technical and administrative difficulties for TV-archives while shifting to digital archiving.

Vision and language, the impossible connection

11.30 - 12.00
Arnold Smeulders, Theo Gevers and Martin Kersten
Faculty of Mathematics, Computer Science, Physics, and Astronomy, University of Amsterdam, the Netherlands

Image search engines call upon the combined effort of computing vision and database technology to advance beyond exemplary systems. In this paper we charter several areas for research and provide an architectural design to accommodate image search engines.

Retrieving Pictures for Document Generation

12.00 - 12.30
Kees van Deemter
Information Technology Research Institute (ITRI), University of Brighton, United Kingdom

Research on Document Generation has started to involve more than language generation alone, focusing not only on putting information into words, but also on putting information into pictures and laying out the words and pictures on a page or a computer screen (e.g. [1]) Most of the research, however, has focused on pictures that are themselves generated from smaller components, making use of a simple compositional semantics. Many practical uses of pictures, however, involve `photographic' pictures, for which it is extremely difficult to provide a compositional semantics. For this reason, `photographic' pictures defy being generated, forcing one to look for an alternative approach.

The approach we are exploring in connection with the What You See Is What You Meant (WYSIWYM, [2]) approach to knowledge editing relies on the inclusion of pictures that are retrieved in their entirety from an annotated library. In the library, each picture is associated with a logical representation which is intended to capture the information that the picture conveys. Each representation is a conjunction of positive literals [3]. Given a library of this kind, different algorithms can be used for retrieving the picture that best suits a given item of information. In this talk, two retrieval algorithms would be discussed, in combination with different types of pictorial representations. Let us call the item of information for which the system wants to retrieve an illustration I. According to algorithm A, the system retrieves the logically weakest picture that (based on its logical representation) logically implies I. According to algorithm B, the system retrieves the logically strongest picture that (based on its logical representation) is logically implied by I.

The library-based approach replaces the generation of pictures by (a particular kind of) retrieval, from an annotated library of pictures. We at ITRI are presently exploring the feasibility of this approach for the generation of pharmaceutical Patient Information Leaflets (PILs). About 60% of the leaflets in the ABPI Compendium of PILs contain pictures, many of which appear in several leaflets [4]. Let us illustrate the use of pictures by looking at a part of one leaflet in the corpus.

  1. Unscrew the cap and squeeze a small amount of ointment, about the size of a match-head, on to your little finger.

  1. Apply ointment to the inside of one nostril.

  1. Repeat for the other nostril

  1. Close your nostrils by pressing the sides of the nose together for a moment. This will spread the ointment inside each nostril.

Note that the text says `the inside of one nostril' (2) and then `the other nostril' (3), without specifying a particular order. By contrast, the first picture of the nose clearly depicts an action involving the left nostril and the second depicts one involving the right nostril: the pictures have to `overspecify' the content of I by focusing on one possible order in which the instructions may be carried out. (It would have been difficult to depict two arbitrary nostrils, while at the same time making clear that one of them is the left and the other the right nostril.) The problem that this raises for the retrieval algorithm B is that B prohibits retrieving a picture whose representation contains information (i.e., the fact that the action involves the left nostril) that is not implied by I. Algorithm A does not run into this problem, but it would face various other problems. For example, when I describes a nose, the system is not supposed to retrieve a picture of a nose with a bleeding soar.

In this talk, the library-based approach to the inclusion of pictures in document generation and its application to PILs would be outlined. More specifically, a solution to the problem of overspecification would be presented based on algorithm B.

M. Maybury and W. Wahlster (Eds.) "Readings in Intelligent User Interfaces". 1998
R. Power and D. Scott, "Multilingual Authoring using Feedback Texts", Proc. of COLING/ACL, Montreal 98. 1998
K. van Deemter, "Representations for Multimedia Coreference". To appear in Proc. of ECAI workshop on Combining AI and Graphics for the Interface of the Future. Brighton, 1998.
"1996-1997 ABPI Compendium of Patient Information Leaflets", The Association of the British Pharmaceutical Industry. 1997

The THISL Spoken Document Retrieval System

14.00 - 14.30
Steve Renals and Dave Abberly
Department of Computer Science, University of Sheffield, United Kingdom

THISL is an ESPRIT Long Term Research Project focused the development and construction of a system to items from an archive of television and radio news broadcasts. In this paper we outline our spoken document retrieval system based on the ABBOT speech recognizer and a text retrieval system based on Okapi term-weighting. The system has been evaluated as part of the TREC-6 and TREC-7 spoken document retrieval evaluations and we report on the results of the TREC-7 evaluation based on a document collection of 100 hours of North American broadcast news.

Phoneme Based Spoken Document Retrieval

14.30 - 15.00
Wessel Kraaij,
Joop van Gent and Rudie Ekkelenkamp
Institute of Applied Physics
David van Leeuwen
Human Factors Research Institute, TNO, the Netherlands

Since speech recognition technology has become mature, retrieval of spoken documents has become a feasible task. We report about two cases which aim at scaleable and effective retrieval of broadcast recordings. The approach is based on a hybrid architecture, which combines the speed of off-line phoneme indexing and precision of wordspotting while maintaining a scaleable architecture, which allows for frequent updates of the database where OOV words are abundant. A pilot experiment has been done on a small database of recordings of a Dutch talkshow. A more extensive evaluation took place in the framework of the SDR track of TREC7 on English broadcast news.

Information novelty and the MMR Metric in Retrieval and Summarization

15.00 - 15.30
Jaime Carbonell and Jade Goldstein
Language Technologies Institute , School of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213 USA

According to Herb Simon, "Human attention is the most precious resources," especially that of overworked individuals in key positions. However, we are producing and disseminating information at an ever-increasing rate, without addressing the ability of humans to assimilate that information any faster. The web is just the latest manifestation of this phenomenon, sometimes labeled the "information glut." Since we cannot change human cognitive capabilities, technological solutions must be sought to better select, present, and summarize the most relevant information in a customizable manner.

Traditional information retrieval (IR) focuses on selecting documents (or web pages) most relevant to a user query. Maximizing precision and recall is the rallying cry of IR. But there is more to intelligent information management. Retrieving four versions of the same document, however relevant, is less useful than retrieving the latest version plus three different and also relevant documents. The Maximal Marginal Relevance principle evaluates documents not only for query relevance but for novelty of information content with respect to already retrieved documents. Moreover, presenting entire documents is wasteful when passages or summaries will suffice.

The presentation describes the MMR metric and a method of automatically producing document summaries based on applicaiton of MMR at the sub-document level. In essence MMR-based summarization combines query-relevance of document content selected for inclusion in the summary with anti-redundancy preferences for maximizing the novelty of each item of information included in the summary. Preliminary user studies show a clear preference for MMR-generated summaries over best human practice (hand-generated published abstracts), as the latter while highly fluent, are a one-size-fits-all summary that ignores the immediate informaiton needs of the user. In this way, we take a small step towards helping tame the proverbial information glut.

[to program] [to home]