We are witnessing an incredible
explosion in the production of information relevant to molecular biology
and biomedicine. This information is stored in most cases as free text.
The pace of production of information currently far exceeds the possibilities
of utilising the information. Therefore, there is growing interest in automated
techniques for harvesting information from the texts available. A number
of groups are active in applying information extraction techniques to free-text
sources. The field is very young and in need of an exchange of methods
and results.
The techniques used are very different,
ranging from statistical techniques to techniques rooted in computational
linguistics. A comparative analysis of the pros and cons of the diverse
techniques is lacking. Indeed, the tasks as seen by different practitioners
seem ill-defined.
The applications in biology today concentrate
on a few questions, particularly protein-protein interactions, but surely
other fields are equally promising.
This workshop proposes to address the
question of information extraction at two levels:
1) At the object level, some would advocate
shallow (mostly statistical) techniques, e.g. as used in text mining, while
others would advocate deeper but more expensive techniques. There is a
trade-off involved, about which we want to learn more.
2) At the meta-level, the definition
of the task, or, more precisely, the range of tasks, must be better defined.
There are diverse models one can derive from work in computer science,
and natural-language engineering in particular: text mining, indexing for
purposes of information retrieval, DARPA's Message Understanding Project,
and more.