Project Name: STEVIN: SoNaR/D-Coi, SPRAAK & NBEST

Abbreviation: STEVIN

Start date: February 1, 2006

End date: April 1, 2011

Project Description:
All projects below are carried out in the STEVIN- programme (funded a.o. by the Taalunie and NWO).

SoNaR (STEVIN Nederlandstalig Referentiecorpus)

SoNaR (2008-2011) aims at the construction of a 500-million-word corpus of contemporary written Dutch, with a balance between Dutch from the Netherlands and Dutch originating from sources in Flanders (Belgium), and between the various genres of language (formal, informal, domain-specific, age-specific, etc.) and channels (publications on paper, web-based texts, etc.). All materials will be annotated according to the design of D-Coi. The corpus is made available through the TST-Centrale, the portal via which the STEVIN results are made available.


D-Coi (completeed in 2007) was a pilot project for SoNaR. It aimed to produce a blueprint for SoNaR. D-coi entailed the design of the corpus and the development (or adaptation) of protocols, procedures and tools that are needed for sampling data, cleaning up, converting file formats, marking up, annotating, postediting, and validating the data. In order to support these developments, a 50-million-word pilot corpus was compiled, partly enriched with linguistic annotations. The resultung pilot corpus demonstrated the feasibility of SoNaR and provided the necessary testing ground on the basis of which feedback can be obtained about the adequacy and practicability of various annotation schemes and procedures, and the level of success with which tools can be applied. The Center for Sprogteknologi (CST) has undertaken the evaluation of the protocols and procedures designed in D-Coi.

SoNaR partners: Language and Speech/CLST (Radboud University Nijmegen); Induction of Linguistic Knowledge (ILK), Tilburg University; Human Media Interaction (HMI), University of Twente; Hogeschool Gent (dept. Vertaalkunde)

HMI contact: Thijs Verschoor

SPRAAK (Speech Processing, Recognition & Automatic Annotation Kit)

The availability of a speech recognition system for Dutch is one of the essential requirements for the language and speech technology (LST) community. Indeed, researchers now are faced with the problem that no good speech recognition tool is available for their purposes or existing tools lack functionality or flexibility.

This project has two primary goals that will be accomplished within a single software framework. The first goal is to develop a highly modular toolkit for research into speech recognition algorithms. It allows researchers to focus on one particular aspect of speech recognition technology without needing to worry about the details of the other components. The second goal is to provide a state-of-the art recogniser for Dutch with a simple interface, so that it can be used by non-specialists with a minimum of programming requirements. Next to speech recognition, the resulting software will enable applications in related fields as well. Examples are linguistic and phonetic research where the software can be used to segment large speech databases or to provide high quality automatic transcriptions.

The existing ESAT recogniser, augmented with knowledge and code from the other partners in this project, is choosen as a starting point. This code base will be transformed to meet the specified requirements. The transformation is accomplished by improving the software interfaces to make the software package more user friendly and adapted for usage in a large user community, and by providing adequate user and developer documentation written in English, so as to make it easily accessible to the international LST community as well.

Project Partners: Katholieke Universiteit Leuven (ESAT/PSI), Radboud Universiteit Nijmegen (Language and Speech), TNO Human Factors (Soesterberg), Universiteit Twente (Human Media Interaction)

HMI Project Contact: Roeland Ordelman

NBEST (Nederlandse Benchmark Evaluatie van SpraakherkenningsTechnologie of Northern and Southern Dutch Benchmark Evaluation of Speech recognition Technology)

Over the years, standardised benchmark evaluation tests have proved indispensable for the development of several techniques in speech technology. In N-Best we will organise and execute an evaluation of large vocabulary speech recognition systems trained for Dutch (both Northern and Southern Dutch) in two evaluation conditions (Broadcast News and Conversational Telephony Speech). The goals of the project are the definition of a proper evaluation setup and a corresponding set of benchmark results. The evaluation framework can serve both as a basis for future evaluations, which can probe the progress in large vocabulary speech recognition for Dutch, and as an aid for the development of new speech recognition technologies for the Dutch language. Participants will use a common speech database, the Corpus Gesproken Nederlands (CGN), for acoustic training of their systems, as well as other common resources for language modeling and pronunciation modeling. They will co-operate through exchange of intermediate experiences, results and models of sub-technologies. The evaluation will be open to researchers outside the project, who will benefit from the common training and evaluation resources and the development experiences of the project partners. Intermediate and final exchange of experimental results and findings will be consolidated in workshops. The evaluation will be based on new speech material that will be collected and annotated for the purpose of this evaluation. All evaluation resources, materials and results will be made available via the TST-centrale.

Project Partners: TNO Soesterberg, SPEX Nijmegen, CLST Nijmegen, HMI Twente, ESAT Leuven, ELIS Gent, EWI Delft

HMI contact: Roeland Ordelman


The following HMI-member(s) is/are coordinator of this Project

Franciska de Jong


