Multilingual Text Mining and Translation for Reaxys (at Elsevier)

Title:Multilingual Text Mining and Translation for Reaxys (at Elsevier)
Institute:University of Twente (HMI)
Place:Enschede The Netherlands
Type:final project
Start date:1 februari 2017
End date:not present
HMI ContactMariƫt Theune

 

Company

Elsevier is the world's biggest scientific publisher, established in 1880. Elsevier publishes over 2,500 impactful journals including Tetrahedron, Cell and The Lancet. Flagship products include ScienceDirect, Scopus and Reaxys. Increasingly, Elsevier is becoming a major scientific information provider. For specific domains, structured scientific knowledge is extracted for querying and searching from millions of Elsevier and third-party scientific publications (journals, patents and books). In this way, Elsevier is positioning itself as the leading information provider for the scientific and corporate research community.

Task Description

Reaxys is a chemistry knowledge base. The knowledge in Reaxys is extracted and compiled from articles and patents. AskReaxys is a search interface for Reaxys. AskReaxys allows users to input queries in natural language. Query parsing is to translate the user queries into internal structured queries which can be executed to find related information in the Reaxys knowledge base.

One important subtask of this is multilingual text mining and translation. The number of patents and articles in languages other than English increases significantly. To increase the coverage of Reaxys, we need to mine knowledge with multilingual text mining techniques and align knowledge in different languages with translation techniques.

Depending on interests, many other work items can identified, such as chemical name recognition, user purpose classification, patent number recognition, relevance ranking of the returned results, etc. Students are encouraged to explore new techniques and publish their work as papers.

Students have the opportunity to work with Elsevier's powerful Spark cluster in the Databricks framework.

Location: Amsterdam or Frankfurt.

Host Group: NLP group, Content & Innovation, Operations, Elsevier.