Q&A with a text mining pro: How difficult is it to text mine?
NeuroElectro founder Dr. Shreejoy Tripathy talks about the value — and challenges — of text mining, and how it could improve in the future
By Rachel Martin and Gemma Hersh Posted on 10 September 2014
Researchers are increasingly applying text- and data-mining techniques to systematically extract key information from the published literature in order to generate new insights. An example of the power of text mining is NeuroElectro, a database that allows users to compare and analyze the electrical properties of neurons.
NeuroElectro was established by Dr. Shreejoy Tripathy (@neuronJoy), a neuroscientist and postdoctoral researcher at the Centre for High-Throughput Biology at the University of British Columbia, while obtaining his PhD in neural computation from Carnegie Mellon University in 2013. His research led him to use his programming knowledge and apply text mining tools in order extract, compare and establish this rich resource for everyone in the research community and beyond.
Dr Tripathy recently chatted with us over the phone about his experiences setting up NeuroElectro and the challenges and future of text mining. This is a condensed version of the interview.
What type of problem were you looking to solve with text mining?
As you may know, the brain is composed of many different cells called neurons, and as a neuroscientist, I am interested in the electrical properties of neurons. Often this type of information exists in the published literature, so I initially wondered if it would be possible to extract this information and put it into a database so I could analyze it. I was looking to see if I could say interesting things about the relationships between different kinds of neurons, or maybe integrate information about the electrical properties of neurons with information on the same neurons from other data sources, like the genetics of different neurons, or their shapes, or their functions or their connections.
The reason why text mining is a really nice solution to this problem is that it would be very difficult and incredible costly for any one lab to re-collect all this data themselves. Which is why, because all this data already exists in the literature, text mining seemed like a good way to get this information very easily.
How did you know how to text mine?
I see myself as a neuroscientist who knows how to write some programming code and not the other way around. At the start of the project, I definitely was not an expert in natural language processing, ontologies or bioinformatics. So I approach this from the angle "I have a research question" and I pretty much learnt everything on the go. Honestly, it hasn't been very difficult because as there are lots of open-source tools that make doing some very simple text mining very easy.
For example, there are other open-source tools that take individual sentences within a paper, like "Jack jumps over the brown fox," and processes that to identify "Jack" as the noun, "jumps" as the verb and so on. This means I can just use these open source off-the-shelf tools and modify these slightly and then use them for my purpose. The barrier to text mining is actually quite low for non-text miner experts as long as you have some amount of programming expertise.
What were your challenges when text mining?
If I had to redo the NeuroElectro project today, knowing what I know now about text mining, I would have done things differently. One of the major challenges I encountered in this project is that different publishers have different HTML publishing standards. So an Elsevier article might call their methods section "materials and methods," but in Wiley or Highwire, they call their methods section "experimental procedures." That problem is the least complicated for articles mined through ScienceDirect, as there is a really nice API and their documents contain a nice structure. I know that my code that, say, parses data from ScienceDirect will work just fine for another article in entirely different journals but still on ScienceDirect.
I also noticed the ScienceDirect API returned XML, and again if I was starting this project from the beginning, I would use only use XML because XML is a better markup language than HTML. But since other publishers don't export XML, I was forced to use HTML, and so I wrote all my algorithms in HTML assuming that the content I was mining was HTML.
My ultimate dream is that there would be just a single API service, where all the content was delivered in a single format, like XML or HTML — I don't really care which one but just pick one — so we could write our algorithms to parse that single format rather than have to deal with different formats across different publishers.
Have you heard of CrossRef's text and data mining service?
Yes, and think that would be perfect for people like me, so we don't have to deal with the differences between publishers. Every additional step where I have to stop and deal with vagaries of different publishers was time spent not solving the problem that I initially set out to do. Anything that publishers can do to reduce the barrier to text and data mining would help me as a researcher because that would let me just do the research and not worry so much about the bureaucracy and licenses and all of this stuff to be able to start text and data mining.
Did you find that registration process difficult?
Let me start from the beginning, as it's helpful to describe what happened. I guess that was January last year (2013), when I realized I needed to get text mining access to the ScienceDirect API. It wasn't entirely clear who I needed to talk to, and ultimately, it was decided that I needed to go find my university librarian. Initially, I didn't know who I was supposed to email within my university so, of course, it's going to take some time to figure out who is the right person to contact about that. Honestly, it took about four to six months to first contact my librarians and have them contact their representative at Elsevier and for me to even have the OK to begin text and data mining.
But after all the legal stuff was in place, getting the actual API key was trivial –it took an hour or so at that.
Were you concerned about what you could and couldn't do with the text mining output?
That is a great question. To provide some context, often times, a scientist will publish a data table that nicely summarizes their study rather than representing this in a figure or image or as free text. So as you can imagine, it is easier to mine the information out of that nicely formatted data table than it is to apply some image algorithm or apply some very sophisticated natural language processing algorithm to mine that out of article text. When I reshow this information on neuroelectro.org, I would like to show the table in which I mine the information and what my algorithm identified as specific entities and concepts. So that the user can then see the original table and then see that table marked up by the algorithm. Honestly, all text mining algorithms always make mistakes and so, by showing the context of where the data was extracted, I am allowing my users to provide feedback when information is incorrectly shown.
Adjusting Elsevier's TDM services to better assist researchers
By working with researchers such as Dr. Tripathy, Elsevier has been able to understand the specific text mining needs of researchers and find ways to improve our technology to support those efforts. This has resulted in some key changes:
We are working with our academic customers to add text- and data-mining rights in our standard ScienceDirect subscription agreement. Further, our self-service developers' portal makes it easier for researchers to automatically gain access to the API for TDM without lengthy delays.
We provide content for text miners through our API, which standardizes the publishing format of the article, making text mining code work consistently and reliably for all journals on our platform.
We use XML format which is preferred by text miners.
We are one of the first publishers to participate in the CrossRef Text and Data Mining Services program, which will provide a single API to text miners across multiple publishers, helping to eliminate the differences in publishers' policies.
(With) all the publishers I have explicit agreements with, I have been very clear about showing the entire table on the website. When I asked explicitly for this (showing the entire data table) as in some cases I show way more than 2,000 characters, no publisher has had a problem. I found people at Elsevier specifically to be incredibly accommodating to me in my requests. Whenever I asked you something like, "Can I show this full table rather than say a small snip?" it's always been a "yes," or "yes but please link," rather than a "no" or a disagreement, or "you can only show 200 characters" – that is never set in stone. It is not clear to researchers that you are as flexible as you are.
Do you see more scientists in your field doing text mining?
Among neuroscientists, I see what I do as a very niche activity, with the majority of neuroscientists still doing the traditional wet lab research. Although text mining is gaining in popularity, I still think that in a field as a big as neuroscience, you only really need 500 text miners in a field of 100,000 people. It's not like we need 10 times more of these researchers doing more text mining in the same exact domain, because text mining scales very well.
What are your thoughts about the future of text mining?
My observation with text and data mining is that it is always going to be imperfect, as the information that I can extract from an article, at least for me, is not quite the information I want. What scientists report in their articles are these common summary measurements of these electrical properties of neurons. What I would love to have is the raw data. The actual measurement they are taking off their neuron as it comes off the amplifier and off the electrode. Effectively, that would give me a thousand times more information, which would be a richer resource to do further analysis, rather than these relatively simple summary measurements. I see text and data mining as a first step, but I would like to move towards raw data sharing. If there was more raw data sharing practices in place, there wouldn't really need to be text and data mining.
What does the future hold for NeuroElectro?
We are going to continue to further develop the resource and keep it up to date. But an even more important part of NeuroElectro is actively utilizing the information we have complied. It's about getting back to answering these research questions like, "What new things can we use learn from the database that we didn't already know before?" We want to demonstrate that this resource is not just a look-up tool for other scientists but that it actively used to discover new information.
One-minute video: Dr Tripathy on NeuroElectro
NeuroElectro is a website established by Dr. Tripathy to organize information on the electrophysiological properties of neuron types in the brain. He explains more in this short video.
Elsevier Connect Contributors
Rachel Martin (@rachelcmartin) is the Universal Access Communications Manager at Elsevier, based in Amsterdam. She is responsible for helping to communicate Elsevier's progress in areas such as open access, philanthropic access programs and access technologies.
As Policy Director on the Policy and Access team at Elsevier, Gemma Hersh (@gemmahersh) is responsible for developing policy for open access, copyright and other areas that impact the scholarly research and publishing communities. Her current work includes looking at open access and copyright developments globally and emerging areas such as Massively Open Online Courses (MOOCs) and Open Education Resources.
Before joining Elsevier this year, Hersh was Head of Public Affairs for the UK Publishers Association and has worked in the creative industries both in government and in industry for the last six years. She holds an MPhil in Politics and Comparative Government from Oxford University, but her real love is History, in which she holds a First Class Degree from Kings College, London.