Evidence-based policy is increasingly important for scientists and policymakers as well as society. Scientists often need to show how their research is influencing the policy environment, and policymakers want to ensure that legislation is built on reliable, peer-reviewed research.
With today’s technology, it’s possible to make policy decisions based on real evidence, but with more and more research being published, it’s hard to track how the two are related – especially if the research is stored in a range of sites and repositories.
To tackle this problem, a team from Elsevier and LexisNexis Risk Solutions (both owned by Elsevier’s parent company, RELX Group) has been working with researchers in Prof. Tim Menzies’ RAISE (Real world AI for SE) lab at North Carolina State University. They are constructing a search engine that lets people find similar legislative and research documents using keywords. In a project assessing the impact of scientific articles on US legislation, they came up with a solution to easily find links between 100,000 legislative documents and 75,000 research articles.
Elsevier provided text mining access to the research articles in Scopus, and the team used the open source HPCC Systems supercomputing platform to carry out the work. The results revealed a strong link between research and legislation, though it’s not yet possible to determine whether research influences legislation or researchers pursue certain avenues because of legislation.
The magic of text mining
Analyzing and comparing the content of 175,000 text-based documents would be impossible to do manually. The key to this project is text mining – a method Dr. Tim Menzies, Professor of Computer Science at North Carolina State University, believes is vital for building our knowledge base:
Eighty percent of the world's data is not in a structured form, so if you really want to create the next generation of knowledge, you have to deal with text. Text survives, text is ubiquitous, text is low bandwidth – text is fantastic!
The human race evolved text for a really good reason: it’s a great way to share and preserve knowledge. One of the surprising results of the last 20 years of artificial intelligence is that there are regular patterns in natural language: human language has a bunch of high-frequency patterns that let people like you and me reason about language. Now we’ve got text-mining tools and cloud computing that let us find and exploit those patterns.
Elsevier actively collaborates with researchers and institutes to facilitate text and data mining by enabling access and by developing our platforms, tools and services to support researchers. Meanwhile, in 2014, LexisNexis Risk Solutions had been reaching out to universities looking for research partners in big data text mining and funded a small set of highly speculative pieces of work. Dr. Menzies and his team were the beneficiaries of the funding in 2014 and 2015, with the task of drawing links between legislative and research documents.
At LexisNexis Risk Solutions, Flavio Villanustre and Trish McCall put Dr. Menzies and his student, PhD candidate George Mathew, in touch with Ann Gabriel , VP of Academic and Researcher Relations at Elsevier. Elsevier’s initial goal was to find citations to the scientific literature in particular pieces of legislation. They started by looking at specific topics.
Mathew went through tens of thousands of documents about legislation on research.gov and found a pattern. He then used that pattern to query the research literature in Scopus and found 75,000 research documents:
We looked at the research articles using a clustering technique. We scrolled down all the science and technology documents on research.gov in the last 20 years and clustered them into 15 groups using latent dirichlet allocation (LDA). This groups similar documents together based on how frequently similar words occur amongst them. For each of the 15 clusters, we identified the top 10 keywords.
They used the keywords to construct search queries in Scopus: for each of the clusters, they used the 10 keywords to read 5,000 documents. They clustered those documents too and saw that the clusters were innately similar for both the legislative documents and the Scopus documents. This was their first “sanity check,” Mathew noted. “This showed us that the method is not bogus; it actually makes sense.”
As Ann Gabriel explained:
Dr. Menzies and the RAISE lab at NC State offered a new way for us to examine the impact of science on legislation. Early attempts sought to identify journal mentions in the legislation, which are not uniformly cited. The clustering technique they developed provided evidence of a connection across topics in both domains – something we can build on with a view to individuals and institutions in future.
Finding the links between research and legislation
Confident the method was sound, Mathew could take the next step and order all the legislative and research documents into vectors. In text mining, the vector space model is an algebraic model for representing documents as vectors, such as index terms. Effectively, it puts the complex, unstructured text into a mathematical space where it’s possible to see clearly how the documents are related to one another and how closely they are related. “The closer the vectors are, the closer the documents are to each other in terms of their content,” he explained.
He then validated the approach using a method called cosine similarity. The results showed that documents returned using the same search terms had a similarity of 97 percent compared to 75 percent for randomly-chosen documents.
John Holt, Senior Architect at LexisNexis Risk Solutions and a consultant on the project, helped Matthew do some of the analysis using legal data, and he worked with him on the text analytics:
It was really interesting – there was almost no end to it! The linkages revealed in the results were encouraging enough that this should be pursued with additional analysis.
The research documents Mathew selected using the pattern were very close to the legislation, which means the team had developed a way to find research related to legislation. But it’s not yet possible to identify a directional link – to show whether the science is influencing particular policies or vice versa. As Dr. Menzies explained:
It’s like two football teams run in opposite directions on the same field – we can't say if legislation is listening to the science. What we can say is they’re playing on the same field. The connection between, say, climate change and legislation documents is far more complicated than we thought. The communication paths are not one single ‘hero’ document changing the world. It's one of the reasons that society supports science – science is a large, diverse collection of research, and in time, that that diversity leads to a better society.
Balancing automation and human intervention
While the team automated most of the structure, it was important to jump with manual sanity checks along the way, as Dr. Menzies said:
One tedious thing Mathew did was to manually check a random sample of the clusters. In the era of big data, it’s important not to go crazy on full automation; you have to inject some humans to validate your results for small randomly-selected samples. Pablo Picasso said, ‘Computers are stupid, they can only give us answers.’ So if you get lots of answers, make sure you're asking the right questions first.
But the process wasn’t nearly as labor intensive as it would have been without the support of Scopus. By working with Elsevier on the project, Mathew and Dr. Menzies could access and text mine the 75,000 documents they identified using the pattern-based search, to track the correlation between research and legislation.
As well as providing the actual content for the project, this also gave the team a methodological advantage, as Mathew explained:
Scopus is an innovative tool for exploring literature, and its features were very helpful for this project. The research documents are organized in a particular manner, and Scopus has an internal algorithm that identifies the particular keywords. I think that if I would have to manually do that, it would be a lot of trouble; the internal rubrics of Scopus, the way it organizes data, and the metadata of the associated documents helped us a lot.
Take DOIs, for example. Mathew said DOIs were extremely useful in getting access to the 75,000 research documents used in the project. And Dr. Menzies noted that DOIs made de-duplication of the results trivial:
In other text-mining domains, de-duplication is a massive task. How do you check if you've got any duplicates in 75,000 documents? You could do some queries to get possible leads, but you'd still need to manually check thousands of documents. It would've been a monstrous task for this research.
The results have already led to Mathew winning third prize for his poster at the 2017 HPCC Systems Technical Poster Presentations Competition, hosted by LexisNexis Risk Solutions. The event brings together researchers and individuals from academia and industry to exchange ideas and share experiences on the advances in HPCC Systems, which the team used in this project.
And there’s more to come.The team’s plans include analyzing a larger body of research and legislation documents, and looking at the links in more detail, including across time. They are now in the process of securing funding to continue their work in this area, and to see whether they can produce an intelligent server of technical documents in the software engineering realm.
The research team at North Carolina State University would like to thank Ann Gabriel at Elsevier and Flavio Villanustre, Trish McCall, Satya Mukhopadhyay and Sanjay Narla at LexisNexis Risk Solutions “for their invaluable support.”Learn more about North Carolina State Computer Science.