Searching Spontaneous Conversational Speech

ACM SIGIR 2007 Workshop - 27 July 2007

Improved Measures for Predicting the Usefulness of Recognition Lattices in Ranked Utterance Retrieval

We consider the problem of evaluating automatic speech recognition lattices to predict their usefulness in speech retrieval applications.

In particular, we focus on ranking utterances by our confidence that they contain a query term. Our purpose is to close the gap between recognition efforts, which have traditionally focused on producing one-best transcripts, and recent retrieval systems, which may utilize multiple transcript hypotheses in indexing and search. We present a simple framework for comparing the ability of two measures to predict how well a system can retrieve a matching lattice.

In a comparison with the traditional measure, simple accuracy (or word error rate), we show with statistical significance that two new measures are superior at predicting a vocabulary independent utterance retrieval system's rank ordering of speech utterances.

Evaluating ASR Output for Information Retrieval

Within the context of international benchmarks and collection specific projects, much work on spoken document retrieval has been done in recent years. In 2000 the issue of automatic speech recognition for spoken document retrieval was declared `solved' for the broadcast news domain. Many collections however, are not in this domain and automatic speech recognition for these collections may contain specific new challenges. This requires a method to evaluate automatic speech recognition optimization schemes for these application areas. Traditional measures such as word error rate and story word error rate are not ideal for this. In this paper, three new evaluation metrics are proposed. Their behaviour is investigated on one cultural heritage collection and performance is compared to traditional measurements on TREC broadcast news data.

Supporting Radio Archive Workflows with Vocabulary Independent Spoken Keyword Search

Archive departments of large radio broadcasters stand to benefit greatly from speech recognition technology and other audio processing techniques. In order to move towards a practical understanding of how these technologies can support archive staff, two large German radio broadcasters, Deutsche Welle and Westdeutscher Rundfunk, commissioned Fraunhofer IAIS to build a German-language radio archive prototype. This paper discusses the development and assessment of the spoken keyword search module of this prototype. The search module was designed and tested in a project group consisting of both multimedia researchers and archive professionals. As a result, the prototype is unique in that its design and evaluation are tuned explicitly to the requirements of archivists. The paper discusses the special needs of radio archive staff and how they were accommodated in the design of the keyword search functionality. In particular, the archive staff required a vocabulary-independent search facility capable of searching for keywords in an archive containing a high proportion of spontaneous speech. Keyword search is implemented using a fuzzy-matching algorithm, which performs a similarity search on syllable transcripts generated by the speech recognizer. An evaluation is carried out to assess whether or not the radio archive prototype fulfilled the needs of archivists.

Advances in SpeechFind: CRSS-UTD Spoken Document Retrieval System

his paper presents our recent advances in our spoken document retrieval system SpeechFind with the Collaborative Digitization Program. SpeechFind for the CDP is currently serving as the search engine for 1,300 hours of CDP audio content. Analysis on CDP corpus shows that the audio content includes a wide range of acoustic conditions, vocabulary selection, and topics. In an effort to determine the amount of user-corrected transcripts needed to impact automatic speech recognition, a web-based online interface for verification of the ASR-generated transcript was developed. In this study, we also present two advanced fusion approaches to merge subword and word-based retrieval methods within a multilingual SDR system. We focus on creating robust multilingual SDR systems employing both word-based and subword-based retrieval methods. In Dynamic Fusion approach, hybrid transcripts/lattices are used to assign dynamic fusion weights to each subsystem. In Hybrid Fusion approach, queries are searched through hybrid lattices. Experimental results on CDP demonstrate that acoustic model adaptation using the verified transcripts is effective in improving recognition accuracy. The fusion algorithms are evaluated in a proper name retrieval task within Spanish Broadcast News domain, where the presented algorithms yield improvements over traditional fusion methods.

Examining the Contributions of Automatic Speech Transcriptions and Metadata Sources for Searching Spontaneous Conversational Speech

The effectiveness of searching spontaneous speech can be enhanced by combining automatic speech transcriptions with semantically related metadata. An important question is what can be expect from transcriptiosn and metadata in terms of retrieval effectiveness. The CLEF 2006 Cross-Language Speech Retrieval (CL-SR) provides a spontaneous speech test collection with manual and automatically derived metadata fields. Using this we investigate the underlying reasons for differing compare search effectiveness of the individual fields. A further important question is how transcriptions and metadata should be combined for the greatest benefit to search accuracy. We compare standard data fusion methods for combining search results for individual fields with the extended BM25 model for weighted field combination (BM25F). Results indicate that BM25F can produce improved search accuracy, but that it is currently important to set its parameters suitably using a suitable training set.

An Analysis of Sentence Segmentation Features for Broadcast News, Broadcast Conversations, and Meetings

Information retrieval techniques for speech are based on those developed for text, and thus expect structured data as input. An essential task is to add sentence boundary information to the otherwise unannotated stream of words output by automatic speech recognition systems. We analyze sentence segmentation performance as a function of feature types and transcription (manual versus automatic) for news speech, meetings, and a new corpus of broadcast conversations. Results show that: (1) overall, features for broadcast news transfer well to meetings and broadcast conversations; (2) pitch and energy features perform similarly across corpora, whereas other features (duration, pause, turn-based, and lexical) show differences; (3) the effect of speech recognition errors is remarkably stable over features types and corpora, with the exception of lexical features for meetings, and (4) broadcast conversations, a new type of data for speech technology, behave more like news speech than like meetings for this task. Implications for modeling of different speaking styles in speech segmentation are discussed.

Results of the 2006 Spoken Term Detection Evaluation

This paper presents the pilot evaluation of Spoken Term Detection technologies, held during the latter part of 2006. Spoken Term Detection systems rapidly detect the presence of a term, which is a sequence of words consecutively spoken, in a large audio corpus of heterogeneous speech material. The paper describes the evaluation task posed to Spoken Term Detection systems, the evaluation methodologies, the Arabic, English and Mandarin evaluation corpora, and the results of the evaluation. Ten participants submitted systems for the evaluation.