What if you could create an accurate summary of a lengthy article at the touch of a button? What if you could quickly scroll through a bibliography, filtered to show only the citations relevant to your needs? What if you could get your research out into the world faster, and have that knowledge built upon sooner?
Science and technology are generating more data than ever faster than ever, so it’s getter harder and harder to keep up and manage this information. Therefore, it’s crucial to find ways to automate the discovery and interpretation of the information we need – and only that information.
To this end, Elsevier and its academic partners are creating new solutions that involve big data, machine learning and natural language processing (“NLP”).
Intelligently enriched bibliographies
Dr. Angelo Di Iorio, Assistant Professor in the Department of Computer Science and Engineering and the University of Bologna, is taking a fresh look at one of the most used parts of any article: the bibliography.
“If we can annotate the citations and bibliographies in a way that can provide more information,” Dr. Di Iorio said, “these will become far more useful for the reader; more useful for the scholars in many of the tasks that they usually perform, like reviewing, reading, comparing information and so on.” He explained:
We are trying to characterize each citation not only in terms of objective information like the date of the publication or the venue, but also to go a bit deeper to understand why a specific paper is cited.
While the project, known as SCAR (“Semantic Coloring of Academic Reference”), is still in its early stages, Dr. Di Iorio has already developed tools that can extract information on citations using natural language processing and common ontologies (representations of concepts and their relationships) that can be openly accessed and connected to other sources of information.
Ultimately, he said, “the idea of the project is to enrich the bibliography in order to give more information about each single entry, instead of conceiving of a bibliography as a monolithic unit.”
“The collaboration with Elsevier is important,” he added, “because they can provide us a huge amount of data in many different fields, and some of these disciplines are things that are almost totally covered by Elsevier. So we can have a clear picture of the literature in one area, and we can compare different ways of using bibliographic information in different areas.”
Teaching machines to summarize
The availability of large datasets is crucial for machine learning projects. Dr. Yi-Ke Guo, Professor of Computing Science at Imperial College London, is using similar Elsevier data but approaching the problem from a different direction. His team uses machine learning and NLP to create meaningful summaries of articles via neural networks. To accomplish this, a large set of documents is fed to the neural network, and then the network is trained to make inferences about any new document based on vocabularies and ontologies provided by Dr. Guo.
Using these tools, the system will be able create cogent, meaningful summaries of research articles, reducing the time it takes to glean the essential information from an article nearly tenfold. “You may write a seventeen-page article, and the goal is to generate a two-page summary.” Dr. Guo laughed, “That would be great, right?”
Those tools, however, require extraordinary computing power. Mapping a complex document like a research article semantically is computationally intense. “Elsevier’s contribution was more than just providing datasets,” Dr. Guo said. “(Via LexisNexis), they also provided the High Performance Computing Environment, and we use that environment along with our underlying computing engine.”
HPCC Systems is an open-source, data-intensive supercomputing platform designed to solve big data problems. According to HPCC Director Flavio Villanustre, “as part of its open-source community outreach efforts, Elsevier has numerous collaboration initiatives with higher education institutions and researchers around the world.”
Novel ways to assess novelty
It was in that spirit of collaboration that Elsevier partnered with Dr. Pushpak Bhattacharyya, Director of the Indian Institute of Technology Patna and Professor of Computer Science and Engineering at Indian Institute of Technology Bombay, to create the Center of Excellence at the Indian Institute of Technology, Patna, a facility dedicated to machine learning and NLP.
Dr. Bhattacharyya, along with his colleagues Dr. Asif Ekbal and Mr. Tirthankar Ghoshal, is working to stem the overflow of information on the publishing side by using machine learning to help journal editors manage submissions. Developing automated support for the article reviewing process is one of the central activities at the Center of Excellence.
Dr. Bhattacharyya said the project was envisioned “with the idea that the publishing industry has to assign submitted articles to the correct set of reviewers, and that the reviewers and the articles should match in terms of the topic and suitability of the article and the proficiency and experience of the reviewer.” Academic journals receive far more articles than they publish, and a system that can automatically review submissions allows editors to focus only on those that are most suited to the journal, thus reducing the time between submission and publication.
The system considers two major factors: scope and novelty. “First and foremost, the article has to be in the scope of the journal,” Dr. Bhattacharyya explained. “Say the journal is on artificial intelligence, and the submission is based on thermodynamics. Then clearly the article is out of scope, and the journal would not be able to accept it.”
Once an article has been deemed to be within the scope of the journal, however, it must also be considered novel.
“The detection of whether something is novel or non-novel is a machine-learning problem,” Dr. Bhattacharyya said. “No respectable journal would accept an article unless the article has something novel to say. We could simply compare words and phrases to see if these are present in existing articles, but that could be misleading; a novel text could appear similar to others on the surface. To solve this, we need to go deeper, using machine learning tools.”
Bhattacharyya explained the process they developed, trying it out with newspaper articles:
We started comparing documents based on lexical similarity, stylistic similarity, and finally, deep semantic similarity, using word, sentence, and paragraph vectors. We created semantic representations of these documents and then checked to see how similar they are to existing documents in the semantic space.
Once Dr. Bhattacharyya’s team is able to determine novelty in newspaper articles, they’ll tackle scientific articles. “We aim to employ the knowledge gained from these experiments on existing scientific data, via the Scopus database, thanks to Elsevier.”
Each of these teams is creating systems to perform tasks that only a few years ago would have been impossible for computers, but which are now becoming commonplace as a result of their collaborations with Elsevier.
Dr. Bhattacharyya reflected on his work:
It is important that our research is tied up with an actual problem. It is important that a live problem should drive the research. This collaboration with Elsevier has been extremely fruitful and useful in that we have got a real, live problem in front of us. We are from academia; academia is always on the lookout for problems that matter.