Outline of TWLT14 presentations[to program] [to home]
Cross-language information retrieval: from naive concepts to realistic applications, Germany
In this paper I combine an overview of the goals and major approaches in cross-language information retrieval with some observations of current trends and with a report on a CLIR project that differs in many respects from most research activities in the fast growing area. In the overview, I will start from a generic model of an information retrieval system. Then the necessary extensions will be introduced that are needed for allowing queries in a language different from the document language. Several options for adding translation technology will be contrasted. I will then report on the research strategy followed in the EU-funded international project Mulinex. In this project a complete modular CLIR system was developed and integrated as the core software for a number of applications and as a plattform for research and technology development.
Integrating Different Strategies for Cross-Language Retrieval
Query expansion adds semantic knowledge to the original query , including
knowledge from the information extraction templates. The following example
can illustrate this. The search term "town hall" in English is
ambiguous between the BUILDING
interpretation and the GOVERNMENT
interpretation. For query translation (apart from monolingual clarification)
we need to disambiguate between these interpretations. By giving the user a
set of related terms (council, mayor for
GOVERNMENT; church, monastery for
BUILDING) and corresponding templates (triggered
by semantic class): services and opening hours for
GOVERNMENT; period and architect
for BUILDING, the system can interactively
disambiguate and translate the search term.
Cross-language Retrieval: using one, some or all possible translations?
11.30 - 12.00
Empirically comparing DT with QT
One, some or all translations?
In this presentation we will use the Dutch topics 1-24 of the TREC CLIR task to search the English document collection for evaluation. We will present a number of techniques that can be used to weight possible translations of a query term. The weights can be used to pick one translation (i.e. the one with the highest weight) or to construct a query with more than one translation. A number of techniques will be discussed that use:
Information Extraction from Bilingual Corpora and its application to Machine-aided Translation
This talk will describe how parallel text extraction algorithms can be used for machine aided translation, focusing on two particular applications: semi-automatic construction of bilingual terminology lexica and translation memory. Automatic word alignment and terminology extraction algorithms can be combined to substantially speed the lexicon construction process. Using a highly accurate partial alignment of term constituents, a terminologist need only deal with minor errors in the recognition of term boundaries.
A translation memory system is an example database of sentences and their translations. Such a system can help to increase the productivity of human translators by finding close or exact matches between new sentences and existing sentences in the source language side of the database and returning their target language translations as a starting point for manual translation. The sentence match must be close, otherwise it takes less time and effort to translate from scratch. This means that it is difficult to obtain high coverage without a large sentence database. The next generation of translation memory systems will use statistical alignment algorithms and shallow parsing technology to improve this coverage, by allowing for linguistic abstraction and partial sentence matching. Shallow parsing can identify sentence fragments on the source and target side and these fragments can be automatically aligned via statistical methods. Abstracting away from lexical units to part-of-speech, number, term, or noun phrase classes will allow these systems to mix and match components (e.g. the noun phrase from one sentence and the verb phrase from another sentence).
Mirror: Multimedia Query Processing in Extensible Databases
The Mirror project investigates the implications of multimedia information retrieval on database design. We assume a modern extensible database system with extensions for feature based search techniques. The multimedia query processor has to bridge the gap between the user's high level information need and the search techniques available in the database. We therefore propose an iterative query process using relevance feedback. The query processor identifies which of the available representations are most promising for answering the query. In addition, it can combine evidence from different sources. Our multimedia retrieval model is a generalization of a well-known text retrieval model. We discuss our prototype implementation of this model, based on Bayesian reasoning over a concept space of automatically generated clusters. The experimentation platform uses structural object-orientation to model the data and its meta-data flexibly, without compromising efficiency and scalability. We illustrate our approach with some first experiments with text and music retrieval.
An Overview of Information Extraction and its Applications to Information Retrieval
Information Extraction Technology refers to a collection of shallow natural-language processing techniques that are particularly well suited to extracting specific, targeted information from texts. Information Retrieval Technology is generally thought of as techniques for retrieving a set of documents maximally relevant to a query from a large set of documents. The two technologies, while related from a users perspective, have generally been thought of as suitable for addressing completely different problems.
This presentation reviews information extraction technology, and considers how it can be applied to improving the precision of routing queries. An experiment in which the SRI FASTUS system is applied to a routing task in TREC-6 is discussed, as well as an Open Domain System that is being developed to make such undertakings easier and quicker in the future.
Combining Linguistic and Knowledge-based Engineering for Information Retrieval and Information Extraction
In the pre-computer era, information retrieval relied on human indexers who manually prepared document representations using so-called controlled terms. Controlled terms are taken from pre-defined resources such as thesauri and classification systems. With the advent of computers, assigning controlled terms has been overshadowed by largely automated uncontrolled-term approaches. Still, controlled terms have some obvious advantages over uncontrolled terms: controlled terms are language- and media-independent: whether a text in English, Japanese or Swahili, or a picture or sound fragment, about a tiger, all will receive the same controlled term `tiger' (assuming that that is the term defined).
The salient disadvantage of controlled-term approaches is their high cost. Preparation and maintenance of resources and the indexing process itself have to be done manually. For lack of data the total cost estimate is unknown, however. It may well be that a controlled-term system makes for better recall/precision results, thus avoiding costs incurred by missing relevant information. Anyway, there is a market for controlled-term systems, and it is sufficiently large to earn several companies a living.
Our own IR work concentrates on a controlled-term approach and attempts to automate the indexing process by a combination of linguistic engineering and knowledge-based techniques. The system will be designed in a modular way, so that switching to a different language does not necessitate reworking the entire system. Ideally, only the language-dependent modules for English are exchanged for those for, say, Japanese.
The project is not just about designing an automated indexing system but also about the interplay of linguistic and knowledge-based techniques. The task of information extraction is a continuation of what essentially is the same track, although more ambitious than that of assigning controlled terms. It is also, in a sense, more pertinent to the goal of information rather than document retrieval: in science in particular, data and knowledge bases are often more convenient vehicles for transmitting the contents of what now are articles. Information extraction may help here in two ways: by enabling an author to construct a data or knowledge base while preparing an article; and by enabling retrospective conversion of the archive of scientific poublications.
Information retrieval: how far will *really* simple methods take you?
16.30 - 17.00
Document, or text, retrieval is one very important information management task. This is particularly the case where the user is seeking documents about some topic, to meet an information need. Conventional bibliographic retrieval systems are frequently limited to Boolean search expressions, while modern WWW engines appear to offer a mishmash of search devices and deliver variable performance.
The normal assumption is that natural language processing is required for more effective retrieval systems, to determine request and document word senses and conceptual relations. But comparative experiments have shown that this is not the case, and that simple, statistical methods are as good and indeed work well.
The talk will present these methods, discuss the reasons why they are effective and what light this throws on the functions and use of language in information processing tasks, and consider how the methods used for retrieval may be extended to other tasks and combined with conventional NLP.
Tuesday, December 8
Cross-Language Information Retrieval: Some Methods and tools
9.00 - 9.30
We describe two related methods of cross-language information retrieval which share a new approach that does not involve machine translation of queries. We compare their strengths with methods that require translating queries.
Talking Pictures: Indexing and Representing Video with Collateral Texts
Visual information sometimes occurs with descriptive textual information: this relationship can be exploited for indexing and representing the contents of images . Language technology may be used to extract indices and representations of multimedia data from the words of experts, spoken as they attend to multimedia artefacts. Such collateral texts provide descriptions which are grounded in domain expertise. An information system for processing collateral texts and multimedia data is presented in this paper. The, so-called, semantic contents of some multimedia artefacts may be best understood, and hence explicated, by the experts of a particular domain: consider, a surgeon examining an X-ray image; a detective with a scene-of-crime photograph; and, the work of art critics, dance scholars and music scholars. The experts knowledge is couched in a special language with an idiosyncratic lexicogrammar . This lexicogrammar comprises a specialist lexicon or terminology and a restricted syntax, and is used by the expert to categorise and discourse on the (sometimes multimedia) phenomena of their domain. However special languages and expert knowledge are not exploited for indexing and representing multimedia information. The artificial intelligence and cognitive psychology literatures discuss knowledge acquisition techniques which can be used to elicit, analyse and represent aspects of human knowledge. Protocol Analysis is such a technique, in which an expert is asked to think aloud as they perform a task: the resultant verbalization is taken to reflect their cognitive processes - and hence their expertise . Can this approach be adapted to access the cognitive and linguistic means that an expert utilises when attending to a multimedia artefact? The resultant verbalization - considered as a collateral text - could then provide a source for indexing and representing the multimedia artefact. Dance studies is a discipline which has emerged over the last 100 years and whose experts discourse on a richly multimedia subject : as such it is a suitable test case for an evaluation of collateral texts. Five dance experts were recorded thinking out loud as they watched excerpts from four different dances - having first been instructed to describe what they saw. Analysis of the resulting transcripts suggests that (i) these collateral texts may be used to index sections of the video by adapting text-based information retrieval methods; and, (ii) representations of video content may be produced by applying information extraction methods. A multimedia information system, KAB (Knowledge-rich Annotation and Browsing), has been developed to process collateral texts alongside digital video. It comprises a video database (20 minutes of dance video) which is populated with indices and representations by the semi-automated analysis of collateral texts (12,870 words of transcribed expert commentary). Key terms, compound terms, and statistically significant collocation patterns have been identified in these texts using corpus linguistics methods [2,5]. These results support information retrieval through, e.g. query expansion (via a domain thesaurus) and information extraction through, e.g. template identification. The KAB system for indexing and representing multimedia information may be of relevance for, among others, the keepers of electronic art galleries, music collections and scene-of-crime databases.
Pop-Eye: language technology for video retrieval
10.00 - 10.30
Going digital at TV-archives: new dimensions of information management for professional and public demands
11.00 - 11.30
This document describes SWRs (German public broadcaster, member of ARD-network) involvement in European multimedia projects for digital video archive solutions. Main focus is on the results of the Esprit project Euromedia and its first implementation for productive usage at one production department. Also the paper describes the technical and administrative difficulties for TV-archives while shifting to digital archiving.
Vision and language, the impossible connection
11.30 - 12.00
Image search engines call upon the combined effort of computing vision and database technology to advance beyond exemplary systems. In this paper we charter several areas for research and provide an architectural design to accommodate image search engines.
Retrieving Pictures for Document Generation, United Kingdom
Research on Document Generation has started to involve more than language generation alone, focusing not only on putting information into words, but also on putting information into pictures and laying out the words and pictures on a page or a computer screen (e.g. ) Most of the research, however, has focused on pictures that are themselves generated from smaller components, making use of a simple compositional semantics. Many practical uses of pictures, however, involve `photographic' pictures, for which it is extremely difficult to provide a compositional semantics. For this reason, `photographic' pictures defy being generated, forcing one to look for an alternative approach.
The approach we are exploring in connection with the What You See Is What You Meant (WYSIWYM, ) approach to knowledge editing relies on the inclusion of pictures that are retrieved in their entirety from an annotated library. In the library, each picture is associated with a logical representation which is intended to capture the information that the picture conveys. Each representation is a conjunction of positive literals . Given a library of this kind, different algorithms can be used for retrieving the picture that best suits a given item of information. In this talk, two retrieval algorithms would be discussed, in combination with different types of pictorial representations. Let us call the item of information for which the system wants to retrieve an illustration I. According to algorithm A, the system retrieves the logically weakest picture that (based on its logical representation) logically implies I. According to algorithm B, the system retrieves the logically strongest picture that (based on its logical representation) is logically implied by I.
The library-based approach replaces the generation of pictures by (a particular kind of) retrieval, from an annotated library of pictures. We at ITRI are presently exploring the feasibility of this approach for the generation of pharmaceutical Patient Information Leaflets (PILs). About 60% of the leaflets in the ABPI Compendium of PILs contain pictures, many of which appear in several leaflets . Let us illustrate the use of pictures by looking at a part of one leaflet in the corpus.
Note that the text says `the inside of one nostril' (2) and then `the other nostril' (3), without specifying a particular order. By contrast, the first picture of the nose clearly depicts an action involving the left nostril and the second depicts one involving the right nostril: the pictures have to `overspecify' the content of I by focusing on one possible order in which the instructions may be carried out. (It would have been difficult to depict two arbitrary nostrils, while at the same time making clear that one of them is the left and the other the right nostril.) The problem that this raises for the retrieval algorithm B is that B prohibits retrieving a picture whose representation contains information (i.e., the fact that the action involves the left nostril) that is not implied by I. Algorithm A does not run into this problem, but it would face various other problems. For example, when I describes a nose, the system is not supposed to retrieve a picture of a nose with a bleeding soar.
In this talk, the library-based approach to the inclusion of pictures in document generation and its application to PILs would be outlined. More specifically, a solution to the problem of overspecification would be presented based on algorithm B.
The THISL Spoken Document Retrieval System, United Kingdom
THISL is an ESPRIT Long Term Research Project focused the development and construction of a system to items from an archive of television and radio news broadcasts. In this paper we outline our spoken document retrieval system based on the ABBOT speech recognizer and a text retrieval system based on Okapi term-weighting. The system has been evaluated as part of the TREC-6 and TREC-7 spoken document retrieval evaluations and we report on the results of the TREC-7 evaluation based on a document collection of 100 hours of North American broadcast news.
Phoneme Based Spoken Document Retrieval
Since speech recognition technology has become mature, retrieval of spoken documents has become a feasible task. We report about two cases which aim at scaleable and effective retrieval of broadcast recordings. The approach is based on a hybrid architecture, which combines the speed of off-line phoneme indexing and precision of wordspotting while maintaining a scaleable architecture, which allows for frequent updates of the database where OOV words are abundant. A pilot experiment has been done on a small database of recordings of a Dutch talkshow. A more extensive evaluation took place in the framework of the SDR track of TREC7 on English broadcast news.
Information novelty and the MMR Metric in Retrieval and Summarization
According to Herb Simon, "Human attention is the most precious resources," especially that of overworked individuals in key positions. However, we are producing and disseminating information at an ever-increasing rate, without addressing the ability of humans to assimilate that information any faster. The web is just the latest manifestation of this phenomenon, sometimes labeled the "information glut." Since we cannot change human cognitive capabilities, technological solutions must be sought to better select, present, and summarize the most relevant information in a customizable manner.
Traditional information retrieval (IR) focuses on selecting documents (or web pages) most relevant to a user query. Maximizing precision and recall is the rallying cry of IR. But there is more to intelligent information management. Retrieving four versions of the same document, however relevant, is less useful than retrieving the latest version plus three different and also relevant documents. The Maximal Marginal Relevance principle evaluates documents not only for query relevance but for novelty of information content with respect to already retrieved documents. Moreover, presenting entire documents is wasteful when passages or summaries will suffice.
The presentation describes the MMR metric and a method of automatically producing document summaries based on applicaiton of MMR at the sub-document level. In essence MMR-based summarization combines query-relevance of document content selected for inclusion in the summary with anti-redundancy preferences for maximizing the novelty of each item of information included in the summary. Preliminary user studies show a clear preference for MMR-generated summaries over best human practice (hand-generated published abstracts), as the latter while highly fluent, are a one-size-fits-all summary that ignores the immediate informaiton needs of the user. In this way, we take a small step towards helping tame the proverbial information glut.
[to program] [to home]