New open access resource will support text mining and natural language processing
Elsevier is providing a corpus that covers breadth of STM content – and a treebank for the research community
By Ron Daniel, PhD Posted on 12 March 2015
Text mining uses the power of computers to scan text and extract key information that can be used to fuel new breakthroughs. To do this, text mining researchers and developers need tools for natural language processing (NLP).
In the old days, most text mining was done by writing explicit patterns of text to look for. These patterns would become very long and complex as they were extended to deal with text complications due to plurals, case, synonyms, homographs, contractions, slang and other variations. More modern tools typically use machine learning methods to examine some body of content (a corpus) that has been annotated with the correct answers to some question of interest. One example of such a question is "What is the part of speech for each word in this sentence?" Learning methods can make use of more information than patterns can, and can more easily adapt to text variations, allowing them to achieve better results.
Supporting text and data mining at Elsevier
As a publisher, we believe it is our job to help meet the needs of researchers, and we are committed to reducing the barriers to mining content. To achieve this, we provide a flexible way researchers can get access to our API, which delivers an optimized XML format, through our self-service portal.
Access to an optimized XML format, while necessary, is not necessarily sufficient on its own to derive analytic results. Text and data mining is a complicated and time-consuming process that not every researcher or lab is in a position to undertake. For example, analyzing and understanding pathways for drug discovery work involves huge numbers of individual transactions, and time and expertise is needed to make sense of the output. There are software packages — both open source and commercial — that can also perform this job on behalf of the researchers. Read a case study.
Presentation at Text Analytics World
Dr. Ron Daniel will talk about Elsevier’s internal text mining toolkit at the Text Analytics World conference March 31 in San Francisco. His presentation is called: Large scale text analysis using Apache Spark.
However, there is a serious issue with such tools. People do not use language in exactly the same way. Think about the difference between tweets and news stories, or fiction vs. corporate reports; there are big differences in the vocabulary, in how content is structured between these different areas, and in the tone and formality of language used. NLP tools work best if they are developed and tested on the same type of content they are expected to encounter in production.
Unfortunately, for people looking to text mine in the science, technology and medicine (STM) domains, our content has some big differences from the newswire content that is most commonly used in the development of NLP tools. STM content is replete with citations, complex technical terms including symbols and position information, and other specific features that are natural to our users but never appear in text meant for general audiences. At the same time, newswire text is full of things that STM content rarely has: many mentions of entertainers, politicians, organizations, products, descriptions of wars, elections, sporting events, product announcements, etc. That is the type of information that most text mining tools are trained to understand by default. This means that most leading-edge NLP tools are not well-suited for the kinds of STM content we care about.
What is a corpus– and why is it so important?
To solve this issue, NLP researchers need an appropriate corpus in order to develop and test their NLP algorithms. As mentioned, a corpus is a body of content – STM articles and chapters in this case – along with additional information or instructions that can help text mining tools figure out what they are supposed to do. Part of speech tagging was already mentioned, but there are many more. Determining the boundaries of phrases in sentences – and whether those phrases are the names of genes, proteins, minerals, etc. – is another good example of what a NLP can do for content within the corpus.
Until now, there hasn’t been such a corpus for the breadth of STM content. There are a few corpora of STM content, but they have all concentrated on single domains such as subfields of biomedicine. Several of those corpora include only abstracts rather than the full text of journal articles, and while the terminology is similar, the structure and use of citations is very different in articles as opposed to abstracts.
To improve this situation and help facilitate the development of NLP tools in an open environment, Elsevier has provided a selection of 110 journal articles from 10 different STM domains as a freely-available and redistributable corpus. The articles were selected from open access articles published by Elsevier and licensed with a Creative Commons CC-BY license. The domains are agriculture, astronomy, biology, chemistry, computer science, earth science, engineering, materials science math and medicine. Currently we provide 11 full-text articles in each of the 10 domains.
Each article is also annotated with the output of a number of NLP methods. Some of those annotations were made automatically, but others have been manually checked and corrected so they can be used to test the accuracy of automatic methods.
For each article in the corpus we provide:
- XML source
- A plaintext version for easier text mining
- Several versions with different annotations. These currently include part of speech tags, sentence breaks, simple noun and verb phrases, root forms of words, syntactic constituents parse trees (see figure), Wikipedia concept identification, and discourse analysis
Community-provided annotations and a treebank
Providing the content is only half the story; we are encouraging all researchers to help annotate the content with many types of annotations beyond those mentioned above. This will allow researchers to compare different algorithms and enable more efficient NLP tools to be created that can handle content from many different STM disciplines.
To help kick-start the process of manually creating test sets, Elsevier has commissioned a treebank over a part of the corpus (10 full-text articles) to be used as a default test set. (A treebank is a corpus that has been annotated with the parse trees for the sentences. Correctly parsing a sentence is a difficult computational task that has many uses. It is also a difficult task for people to do, thus the rarity of treebanked corpora.) As new annotation types are added by the community, we encourage them to use those 10 articles as their first choice for manually reviewed and corrected test data.
How to download the corpus and treebank
You can download the corpus and the first part of the treebank now from the Elsevier Labs’ GitHub site.
The remaining articles for the treebank will be added in the next few months.
Our goal is for this corpus and treebank to become a valuable resource for natural language processing, linguistics, and text mining researchers, developers, and users. We see our role as being part of the wider research community working towards doing a better job of processing STM content for the greater good of science.
Elsevier Connect Contributor
Dr. Ron Daniel (@rdanielmeta) is the Director of Elsevier Labs. He has done extensive work on metadata standards such as the Dublin Core, RDF and PRISM. He was lured away from the Los Alamos National Laboratory to work at a startup that was later acquired for its automatic classification technology. He then consulted on taxonomy, metadata, and information management issues for nine years. Ron is bemused by the way technology reincarnates itself — specifically that parallel implementations of neural networks for machine vision are currently in vogue, just as they were in the late 1980s when he was working on them in grad school.
About Elsevier Labs
Elsevier Labs is the advanced research and development (R&D) group for Elsevier, concentrating on issues around creating and using "smart content" and on the future of research communications. Labs has a number of text mining products to help researchers find the information they need. They have been adding sophisticated natural language processing (NLP) tools an internal text mining toolkit they use to analyze content and prototype new product features. Many of these tools come from the leading edge of research into NLP and computational linguistics.