Twente News Corpus (TwNC)

a Multifaceted Dutch News Corpus


The Twente News Corpus (TwNC), a multifaceted corpus for Dutch that is being deployed in a number of NLP research projects among which:
  • tracks within the Dutch national research programme MultimediaN;
  • the NWO programme CATCH;
  • the Dutch-Flemish programme STEVIN.

The development of the corpus started in 1998 within a predecessor project DRUID and has a size of more then 500M words. The text part has been built from texts of four different sources:

  • Dutch national ;
  • television ;
  • teleprompter ;
  • both manually and automatically generated broadcast news along with the broadcast news audio.

TwNC plays a crucial role in the development and evaluation of a wide range of tools and applications for the domain of multimedia indexing, such as large vocabulary speech recognition, cross-media indexing, cross-language information retrieval etc. Part of the corpus was fed into the Dutch written text corpus in the context of the Dutch-Belgian STEVIN project D-COI that was completed in 2007.

The original goal for starting the development of the Twente News Corpus (TwNC) was to collect data for the training of language models and acoustic models to be incorporated into a system for large vocabulary speech recognition for Dutch to be deployed in the broadcast news domain and, also, as a baseline system in other domains that lack large amounts of example data (e.g., cultural heritage data as we encounter in the Dutch CHoral project). The focus on news was given in by the size of the datasets available for this domain, and by the focus on news as target at many other research groups. News is a target domain for corpus development, for search applications and for speech technology.

Several requirements come from this type of deployment for a text corpus. They pertain to formatting, encoding, size and balancing, for example. TwNC text data has been formatted as XML and the encoding chosen is utf-8. Balancing is reached by selecting four different source types: newspaper text, autocue files (teleprompter text), subtitling files and manually generated transcripts.