Elsevier updates text-mining policy to improve access for researchers
Efforts support researchers’ need to mine text and data, paving the way for future innovation Our experts work with governments, funding bodies, universities and researchers to provides data-based evidence to inform strategic decisions in research
By Chris Shillum Posted on 31 January 2014
Here is a more recent article on Elsevier's text and data mining policy: How does Elsevier's text mining policy work with new UK TDM law?
With text and data mining (TDM) becoming an increasingly important tool, Elsevier has updated its TDM policy to meet the changing needs of researchers. The new policy allows academic researchers at subscribing institutions to text mine subscribed content for research purposes. This article explores how TDM is being supported by Elsevier and the impact of the new policy on researchers.
Researchers are increasingly applying text- and data-mining techniques to systematically analyze the growing volume of scholarly output in order to extract latent knowledge and generate new insights. Text and data mining encapsulates many different methods used to extract key information and discover patterns and trends across large volumes of content.
For the past several years, Elsevier has enabled researchers to text mine content published on ScienceDirect upon request. Over this period, we have seen the demand for text mining grow gradually as the volume of scholarly output increases and computing power and text-mining tools improve. By working with the research community, and though various pilots, we have been able to understand the specific text-mining needs of researchers and find ways to improve our technology to support those efforts.
From our pilots, we have found that research questions typically fall into one or both of two categories: they seek to answer a specific research question or build a new data resource for the community.
In the first case – where the answer to a question is being sought – there is a defined hypothesis to be tested, so the challenge lies in how to extract information from research articles that either satisfy or disprove the hypothesis. Researchers apply a variety of statistical, machine-learning and natural language processing (NLP) techniques to extract entities and relationships of interest from the unstructured textual information, which can then be further analyzed using conventional techniques.
In the second case, researchers are looking to build resources such as databases of entities, properties and relationships that may be reused by the community for further research. For example, the NeuroElectro database extracted information about the electrophysiological properties (the resting membrane potentials and membrane time constants) of diverse neuron types from scholarly articles and made it available to everyone in a searchable database.
Elsevier's updated policy on how researchers can text mine
Recognizing these needs and building on experiences from our pilots, Elsevier has now formalized an updated text and data mining policy. Our new policy enshrines text- and data-mining rights in our standard ScienceDirect subscription agreement for academic customers. Further, our self-service developers' portal makes it easier for researchers to gain access to content for TDM without lengthy delays while permission is sought
Text Mining Primer
API (application programming interface) – An interface for a software program that enables interaction with other software, similar to the way a user interface facilitates interaction between humans and computers.
Entity – In text mining, an entity may refer to a group of words, code, statistics or anything else in the document that can provide information. For Elsevier's customers, entities of interest often include such things as chemical names, genes, proteins or sequences.
Text mining – The process of deriving information from articles by extracting word patterns and other relationships that could lead to new discoveries.
For academic customers, text- and data-mining rights for non-commercial purposes will be included in all new ScienceDirect subscription agreements and upon renewal for existing customers. Librarians interested in adding the TDM clause to their existing agreement prior to renewal are able to request a simple contract amendment via their Elsevier Account Manager.
Once the institutional agreement is updated, researchers at subscribing institutions can use our developers' portal to register. They will then receive a key to the Application Programming Interface (API) of ScienceDirect, which provides full-text content in XML and plain-text formats optimal for TDM.
When researchers have completed their text-mining project through the API, the output can be used for non-commercial purposes under a CC BY-NC license. The output can contain "snippets" of up to 200 characters of the original text, which enables both the researchers who are answering a specific question and those looking to build resources to define the context of the new information they've extracted from the literature. Elsevier also requests that text-mining researchers include a DOI link back to the original content to ensure that authors receive credit and that future researchers have a reliable reference to the authoritative source of the underlying articles.
Working with the industry
In most cases, in order to achieve their objectives, researchers would like to be able to mine content from multiple sources across multiple publishers without having to go through the time-consuming process of establishing and gaining access to content from each publisher separately. This problem was formally addressed in 2013, when Elsevier was one of 16 publishers who signed the International Association of Scientific, Technical, and Medical Publisher's (STM) Text and Data Mining for Non-Commercial Scientific Research commitment. This agreement outlined the need for a common understanding among publishers to ensure that content was mineable, and it has paved the way for new initiatives involving text mining across the industry.
One initiative Elsevier is collaborating on is CrossRef's Prospect, currently in beta. Prospect aims to solve the most frequent pain points for researchers seeking to mine content from multiple publishers by providing two services:
- A common API that can be used by researchers to access the full text of content across publisher sites using a single, consistent mechanism.
- A common license framework that enables researchers to read and agree to terms and conditions from multiple publishers in a single portal.
Elsevier is the first publisher to have fully integrated with the Prospect service, but many other publishers are expected to follow in coming months.
What is the future of TDM?
One of the final conclusions from our pilots is that, while everyone recognizes the opportunity that text mining brings, it is a specialized process. Many researchers are looking for services that make this process easier so they can concentrate on the part they do best – research. Here at Elsevier, we are continually working on ways to make it easier to text mine by both improving our technology support and optimizing the publication process to make content mineable.
Because the real value for text mining is the underlying highly curated and enriched XML format, Elsevier is working on a number of projects to enhance the content and make it richer for text miners. Examples include our data linking projects as well as pilots to help manage data. In addition, we have been scaling up our open access publishing program and are offering user licenses that have inbuilt text mining permissions for authors to choose.
We also continue to work on a number of pilots to improve our technology support. For example, in our pilot with The National Centre for Text Mining (NaCTeM), we are integrating their text-mining infrastructure with Elsevier content utilizing cloud technology, thus avoiding the need for researchers to build and maintain TDM infrastructure.
The future is exciting for text-mining possibilities, with both publisher and industry-wide solutions making it easy than ever for academics to text mine. Elsevier's updated policy is a reflection of our ongoing collaboration within the research community. We encourage anyone with feedback and ideas to contact us via firstname.lastname@example.org.
Presentation: Facilitating Text and Data Mining
Chris Shillum, VP of Product Management, Platform and Content for Elsevier, formally launched Elsevier's updated text- and data-mining policy on Sunday at the 2014 American Library Association Midwinter Meeting.
Text mining example — information extraction
Elsevier Connect Contributor
As VP of Product Management, Platform and Content for Elsevier, Chris Shillum is responsible for Elsevier's shared product platform. His team looks after the content management systems, access management systems and APIs that power our flagship products including ScienceDirect and Scopus. His current work includes looking into the application of text analytics and big-data processing capabilities to our products, and he is helping to define our text-mining strategy. He also represents Elsevier on the boards of key industry nonprofit organizations, including CrossRef and the International DOI Foundation. Most recently, he has been leading Elsevier's participation in the ORCID initiative and helping to ensure its success.
Shillum has worked in various capacities on Elsevier's online products since joining the original ScienceDirect team in 1997, when he designed and implemented the original end-to-end content workflow. He has deep and broad knowledge of platform architectures, search technology, content management technology and federated access management systems. He holds a master's degree in electronic systems engineering from the University of York in the UK.