|
Project Name: STEVIN: SoNaR/D-Coi, SPRAAK & NBEST
Abbreviation: STEVIN Start date:
February 1, 2006 End date:
April 1, 2011 Project Description: All projects below are carried out in the STEVIN-
programme (funded
a.o. by the Taalunie and NWO).
SoNaR (STEVIN Nederlandstalig Referentiecorpus)
SoNaR (2008-2011) aims at the construction of a 500-million-word
corpus of contemporary written Dutch, with a balance between Dutch
from the Netherlands and Dutch originating from sources in Flanders
(Belgium), and between the various genres of language (formal,
informal, domain-specific, age-specific, etc.) and channels
(publications on paper, web-based texts, etc.). All materials will
be annotated according to the design of D-Coi. The
corpus is made available through the TST-Centrale,
the portal via which the STEVIN results are made available.
D-coi
D-Coi (completeed in 2007) was a pilot project for SoNaR. It
aimed to
produce a blueprint for SoNaR. D-coi entailed the design of the
corpus and the development (or adaptation) of protocols, procedures
and tools that are needed for sampling data, cleaning up, converting
file formats, marking up, annotating, postediting, and validating the
data. In order to support these developments, a 50-million-word pilot
corpus was compiled, partly enriched with
linguistic annotations. The resultung pilot corpus demonstrated
the feasibility of SoNaR and provided the necessary
testing
ground on the basis of which feedback can be obtained about the
adequacy and practicability of various annotation schemes and
procedures, and the level of success with which tools can be applied.
The Center for Sprogteknologi (CST) has
undertaken the evaluation of the protocols and procedures designed
in D-Coi.
SoNaR partners: Language and Speech/CLST (Radboud
University
Nijmegen); Induction of Linguistic Knowledge (ILK), Tilburg
University; Human Media Interaction (HMI), University of
Twente; Hogeschool Gent (dept. Vertaalkunde)
HMI contact: Thijs Verschoor
SPRAAK (Speech Processing, Recognition & Automatic
Annotation Kit)
The availability of a speech recognition system for Dutch is one
of the essential requirements for the language and speech technology
(LST) community. Indeed, researchers now are faced with the problem
that no good speech recognition tool is available for their purposes
or existing tools lack functionality or flexibility.
This project has two primary goals that will be accomplished
within
a single software framework. The first goal is to develop a highly
modular toolkit for research into speech recognition algorithms. It
allows researchers to focus on one particular aspect of speech
recognition technology without needing to worry about the details of
the other components. The second goal is to provide a state-of-the
art
recogniser for Dutch with a simple interface, so that it can be used
by non-specialists with a minimum of programming requirements. Next
to
speech recognition, the resulting software will enable applications
in
related fields as well. Examples are linguistic and phonetic research
where the software can be used to segment large speech databases or
to
provide high quality automatic transcriptions.
The existing ESAT recogniser, augmented with knowledge and code
from the other partners in this project, is choosen as a starting
point. This code base will be transformed to meet the specified
requirements. The transformation is accomplished by improving the
software
interfaces to make the software package more user friendly and
adapted
for usage in a large user community, and by providing adequate user
and developer documentation written in English, so as to make it
easily accessible to the international LST community as well.
Project Partners: Katholieke Universiteit Leuven
(ESAT/PSI),
Radboud
Universiteit Nijmegen (Language and Speech), TNO Human Factors
(Soesterberg), Universiteit Twente (Human Media Interaction)
HMI Project Contact: Roeland Ordelman
NBEST (Nederlandse Benchmark Evaluatie van
SpraakherkenningsTechnologie of Northern and Southern Dutch Benchmark
Evaluation of Speech recognition Technology)
Over the years, standardised benchmark evaluation tests have
proved
indispensable for the development of several techniques in speech
technology. In N-Best we will organise and execute an evaluation of
large vocabulary speech recognition systems trained for Dutch (both
Northern and Southern Dutch) in two evaluation conditions (Broadcast
News and Conversational Telephony Speech). The goals of the project
are the definition of a proper evaluation setup and a corresponding
set of benchmark results. The evaluation framework can serve both as
a
basis for future evaluations, which can probe the progress in large
vocabulary speech recognition for Dutch, and as an aid for the
development of new speech recognition technologies for the Dutch
language. Participants will use a common speech database, the Corpus
Gesproken Nederlands (CGN), for acoustic training of their systems,
as
well as other common resources for language modeling and
pronunciation
modeling. They will co-operate through exchange of intermediate
experiences, results and models of sub-technologies. The evaluation
will be open to researchers outside the project, who will benefit
from
the common training and evaluation resources and the development
experiences of the project partners. Intermediate and final exchange
of experimental results and findings will be consolidated in
workshops. The evaluation will be based on new speech material that
will be collected and annotated for the purpose of this evaluation.
All evaluation resources, materials and results will be made
available
via the TST-centrale.
Project Partners: TNO Soesterberg, SPEX Nijmegen, CLST
Nijmegen, HMI Twente, ESAT Leuven, ELIS Gent, EWI Delft
HMI contact: Roeland Ordelman
|
The following HMI-member(s) is/are coordinator of this Project
Franciska de Jong
|