Demystifying text and data mining

What it is, how it works and how Elsevier can help you get started

There are millions of articles and book chapters out there, packed with information that might help you in your work. But how do you find what you need?

Text and data mining (TDM) could be the answer. TDM relies on new technologies to provide a better way of filtering and analyzing, helping you understand vast data resources. The TDM tools use natural language processing (NLP) – a form of machine learning. However, TDM is more than just a simple search tool like Google or Bing; it can also analyze the output, detecting new connections and patterns at a volume and speed that would be impossible to achieve manually. Not only does this make research more efficient, saving time and money, it can also transform each step of the process, making research more effective.

This two-minute video introduces you to the basics.

How does text and data mining work?

There are generally five steps involved in TDM:

  • Define exactly what you are looking for. In our experience, most researchers are either looking to answer a specific question or test a hypothesis across a large number of articles or period of time. Identifying the fundamental ‘problem’ that you need to solve will determine the type of content you need and the TDM approach you should use to analyze it.
  • Select your approach. Typically, TDM tools are designed for general internet content such as news items or social media posts. So if you are researching scientific, technical and medical (STM) content – with its own jargon, abbreviations, uniquely formatted references and so on – you will probably need customized tools.

    There are three main options, each requiring different levels of knowledge of programming, statistics and linguistics:
    • Off-the-shelf. If you have basic technological skills, a ready-made TDM workbench can provide you with the building blocks to put together a customized tool.
    • Build your own pipeline. With more advanced programming skills, you can create your own tools.
    • Outsource to a specialist provider. Accurate text mining requires specialist expertise in NLP.
  • Access, download and extract: TDM tools are run against a working set of data and/or content known as a ‘corpus’. To assemble your corpus, you will need to bulk download the material you wish to mine. In the case of scientific content, you can download it from publisher platforms, using an Application Programming Interface (API). Elsevier has an API that lets you access and download articles and book chapters for text mining. You can also download your corpus from multiple publisher platforms using CrossRef’s Text and Data Mining services.
  • Analyze your results: After applying your TDM tools you will be left with a set of extracted values. Your TDM tool will be able to analyze the output to detect, and possibly visualize, trends and relationships.
  • Answer the problem: The output from text mining results can yield new insights in order to help answer your research questions. After text mining, you can write up your results in a new research paper or do some further experiments having eliminated some initial theories.

This short video explains how the process works.

Defining TDM

Text mining is the data analysis of the written word (articles, books, etc.), using text as a form of data.

Data mining is the numeric analysis of data works (like filings and reports).

At Elsevier, we support researchers who want to mine text and data. All our journals and book chapters are converted into XML, which is a text mining-friendly format, and they are available to mine through our API. To use this, we have set up a developers portal where you can easily register for an account and automatically create an API key to get started.

For more information on how to get started, see our text and data mining page.


Written by

Rachel Martin

Written by

Rachel Martin

Rachel Martin is the Access and Policy Communications Manager at Elsevier, based in Amsterdam. She is responsible for helping to communicate Elsevier's progress in areas such as open access, open science, research data, philanthropic access programs and access technologies.


comments powered by Disqus