Elsevier updates text-mining policy to improve access for researchers

Efforts support researchers’ need to mine text and data, paving the way for future innovation Our experts work with governments, funding bodies, universities and researchers to provides data-based evidence to inform strategic decisions in research

Editor's note

Here is a more recent article on Elsevier's text and data mining policy: How does Elsevier's text mining policy work with new UK TDM law?

With text and data mining (TDM) becoming an increasingly important tool, Elsevier has updated its TDM policy to meet the changing needs of researchers. The new policy allows academic researchers at subscribing institutions to text mine subscribed content for research purposes. This article explores how TDM is being supported by Elsevier and the impact of the new policy on researchers.

Researchers are increasingly applying text- and data-mining techniques to systematically analyze the growing volume of scholarly output in order to extract latent knowledge and generate new insights. Text and data mining encapsulates many different methods used to extract key information and discover patterns and trends across large volumes of content.

For the past several years, Elsevier has enabled researchers to text mine content published on ScienceDirect upon request. Over this period, we have seen the demand for text mining grow gradually as the volume of scholarly output increases and computing power and text-mining tools improve. By working with the research community, and though various pilots, we have been able to understand the specific text-mining needs of researchers and find ways to improve our technology to support those efforts.

From our pilots, we have found that research questions typically fall into one or both of two categories: they seek to answer a specific research question or build a new data resource for the community.

In the first case – where the answer to a question is being sought – there is a defined hypothesis to be tested, so the challenge lies in how to extract information from research articles that either satisfy or disprove the hypothesis. Researchers apply a variety of statistical, machine-learning and natural language processing (NLP) techniques to extract entities and relationships of interest from the unstructured textual information, which can then be further analyzed using conventional techniques.

In the second case, researchers are looking to build resources such as databases of entities, properties and relationships that may be reused by the community for further research. For example, the NeuroElectro database extracted information about the electrophysiological properties (the resting membrane potentials and membrane time constants) of diverse neuron types from scholarly articles and made it available to everyone in a searchable database.

Elsevier's updated policy on how researchers can text mine

Recognizing these needs and building on experiences from our pilots, Elsevier has now formalized an updated text and data mining policy. Our new policy enshrines text- and data-mining rights in our standard ScienceDirect subscription agreement for academic customers. Further, our self-service developers' portal makes it easier for researchers to gain access to content for TDM without lengthy delays while permission is sought

Text Mining Primer

API (application programming interface) – An interface for a software program that enables interaction with other software, similar to the way a user interface facilitates interaction between humans and computers.

Entity – In text mining, an entity may refer to a group of words, code, statistics or anything else in the document that can provide information. For Elsevier's customers, entities of interest often include such things as chemical names, genes, proteins or sequences.

Text mining – The process of deriving information from articles by extracting word patterns and other relationships that could lead to new discoveries.

For academic customers, text- and data-mining rights for non-commercial purposes will be included in all new ScienceDirect subscription agreements and upon renewal for existing customers. Librarians interested in adding the TDM clause to their existing agreement prior to renewal are able to request a simple contract amendment via their Elsevier Account Manager.

Once the institutional agreement is updated, researchers at subscribing institutions can use our developers' portal to register. They will then receive a key to the Application Programming Interface (API) of ScienceDirect, which provides full-text content in XML and plain-text formats optimal for TDM.

When researchers have completed their text-mining project through the API, the output can be used for non-commercial purposes under a CC BY-NC license. The output can contain "snippets" of up to 200 characters of the original text, which enables both the researchers who are answering a specific question and those looking to build resources to define the context of the new information they've extracted from the literature. Elsevier also requests that text-mining researchers include a DOI link back to the original content to ensure that authors receive credit and that future researchers have a reliable reference to the authoritative source of the underlying articles.

Working with the industry

In most cases, in order to achieve their objectives, researchers would like to be able to mine content from multiple sources across multiple publishers without having to go through the time-consuming process of establishing and gaining access to content from each publisher separately. This problem was formally addressed in 2013, when Elsevier was one of 16 publishers who signed the International Association of Scientific, Technical, and Medical Publisher's (STM) Text and Data Mining for Non-Commercial Scientific Research commitment. This agreement outlined the need for a common understanding among publishers to ensure that content was mineable, and it has paved the way for new initiatives involving text mining across the industry.

CrossRef's Prospect

One initiative Elsevier is collaborating on is CrossRef's Prospect, currently in beta. Prospect aims to solve the most frequent pain points for researchers seeking to mine content from multiple publishers by providing two services:

  1. A common API that can be used by researchers to access the full text of content across publisher sites using a single, consistent mechanism.
  2. A common license framework that enables researchers to read and agree to terms and conditions from multiple publishers in a single portal.

Elsevier is the first publisher to have fully integrated with the Prospect service, but many other publishers are expected to follow in coming months.

What is the future of TDM?

One of the final conclusions from our pilots is that, while everyone recognizes the opportunity that text mining brings, it is a specialized process. Many researchers are looking for services that make this process easier so they can concentrate on the part they do best – research. Here at Elsevier, we are continually working on ways to make it easier to text mine by both improving our technology support and optimizing the publication process to make content mineable.

Because the real value for text mining is the underlying highly curated and enriched XML format, Elsevier is working on a number of projects to enhance the content and make it richer for text miners. Examples include our data linking projects as well as pilots to help manage data. In addition, we have been scaling up our open access publishing program and are offering user licenses that have inbuilt text mining permissions for authors to choose.

NaCTeMWe also continue to work on a number of pilots to improve our technology support. For example, in our pilot with The National Centre for Text Mining (NaCTeM), we are integrating their text-mining infrastructure with Elsevier content utilizing cloud technology, thus avoiding the need for researchers to build and maintain TDM infrastructure.

The future is exciting for text-mining possibilities, with both publisher and industry-wide solutions making it easy than ever for academics to text mine. Elsevier's updated policy is a reflection of our ongoing collaboration within the research community. We encourage anyone with feedback and ideas to contact us via universalaccess@elsevier.com.

Presentation: Facilitating Text and Data Mining

Chris Shillum, VP of Product Management, Platform and Content for Elsevier, formally launched Elsevier's updated text- and data-mining policy on Sunday at the 2014 American Library Association Midwinter Meeting.


Text mining example — information extraction

An example of text-mining is the UCSC Genome Browser project, created by Dr. Max Haeussler as a post-doctoral scholar of biomolecular engineering at the University of California, Santa Cruz (UCSC). Used by thousands of researchers around the world, this tool enables researchers to discover papers that mention specific genes and overlapping sequences. As part of this ongoing project, Dr. Haeussler has text-mined more than 6 million Elsevier articles and a subset of open access content from PubMed Central looking for gene sequences mentioned in papers. He mapped the results onto the human genome representation in the UCSC Genome Browser.[divider]

Elsevier Connect Contributor

Chris ShillumAs VP of Product Management, Platform and Content for Elsevier, Chris Shillum is responsible for Elsevier's shared product platform. His team looks after the content management systems, access management systems and APIs that power our flagship products including ScienceDirect and Scopus. His current work includes looking into the application of text analytics and big-data processing capabilities to our products, and he is helping to define our text-mining strategy. He also represents Elsevier on the boards of key industry nonprofit organizations, including CrossRef and the International DOI Foundation. Most recently, he has been leading Elsevier's participation in the ORCID initiative and helping to ensure its success.

Shillum has worked in various capacities on Elsevier's online products since joining the original ScienceDirect team in 1997, when he designed and implemented the original end-to-end content workflow. He has deep and broad knowledge of platform architectures, search technology, content management technology and federated access management systems. He holds a master's degree in electronic systems engineering from the University of York in the UK.

comments powered by Disqus

3 Archived Comments

Alberto Duran Meza February 10, 2014 at 1:02 pm

I´m aware all benefits and Academic purposes of your great Organization Trust,

yes, you can consider that, I´m author directly connected with Mathematical


Thanks you.

Beste regards,

sincerely, Alberto Duran Meza.

University JMVargas, Faculty Education/Engineering


Vic Patrangenaru February 16, 2014 at 3:45 pm

Keep me posted on this.

Xinyou Meng February 19, 2014 at 1:09 am

Very good!


Related Stories