When a team of highly motivated students and recent graduate researchers joined together during an intense weekend hackathon to aid in drug discovery for an orphan disease, they knew the data was going to be a huge challenge. Massive amounts of data trapped in disparate formats were going to slow the investigative process down enormously. But over the course of the weekend, they helped move the discovery project forward and learned how to streamline similar projects in the future.
Managing the ‘data dump'
The Pistoia Alliance, a global nonprofit that encourages pre-competitive collaboration among pharmaceutical companies, biotechs, vendors, publishers and academic groups, sponsored the President’s Series – Deep Learning Hackathon, held one weekend in March 2017 at WallSpace in London.
Elsevier actively supports the alliance and the Hackathon initiative, which connects deep/machine learning with the life science and healthcare industries to form weekend teams to aid in drug discovery.
Elsevier provided some initial curation of data on disease targets and potential drug targets ahead of time; however, the “Find Me a Drug” team, sponsored by Elsevier and the UK-based nonprofit Findacure, was still presented with what amounted to a data dump.
Working their way out of that data quagmire proved the most difficult part. “We spent most of our time the first day just trying to get our heads around the data so we could start to find some solutions,” said team member Daniel Rhodes, a PhD candidate in drug repurposing at Queen Mary College London. The team was tasked with interpreting data to uncover leads to drugs that might be repurposed to treat Friedreich’s ataxia – a rare hereditary disease that causes damage to the nervous system and problems with movement.
We assume that because we have so much full-text data, just by making it available, researchers will know how to use it. But they often don’t, and when they don’t, they bypass it and go to the data they do understand. That leads at best to inefficiencies, and at worst to missing crucial input. — Tim Hoctor, VP, Professional Services, R&D Solutions, Elsevier
When Rhodes works on his PhD project, he takes a good deal of time getting to know his data and deciding what to do with it. But the hackathon team didn’t have that option. “Even opening the files was tricky,” he said. They used various tools to try to extract data from the provided XML files, but it was slow going. “We wound up having to do a lot of things manually so we could at least read the files in plain text.”
While grappling with the ataxia data, the team discussed the avenues they wanted to pursue, such as connecting the various datasets to create a comprehensive network graph. But they quickly realized those creative ideas were too broad to be accomplished in the allotted time.
“We ended up with a relatively simplified network graph that was a visualization of just one of the datasets, something that was definitely manageable,” Rhodes said. “By cutting through a lot of the noise of the large dataset, and keeping our approach simple, we were able to identify areas where researchers might want to apply machine learning in the future. So although we really didn’t do any machine learning ourselves, we definitely pointed to opportunities for it.”
Reaping the rewards
At the end of the weekend, the hackathon teams presented their findings to a panel of judges from the Pistoia Alliance. As a runner up, the Elsevier/Findacure team received £500 GBP, some of which was donated to Findacure (see breakout box below).
For Findacure CEO Dr. Rick Thompson, the most important benefit from the event was that it “instantly raised the level of awareness for the concept of repurposing in rare diseases.” Highlighting how the team explored and began to work with the ataxia data, he said, “helped everyone see how data-mining can very quickly begin to make a difference to patient groups for diseases that are otherwise not getting any attention or study.”
The hackathon team gave its dataset visualization to Findacure, which in turn showed it to scientific advisors at the nonprofit Ataxia UK. The visualization could be used to point to new research avenues or validate work currently underway, Dr. Thompson said.
Either way, he said, “knowing how much this team did in essentially 24 hours, starting from having no clue about ataxia beforehand, is something that can hook researchers, and that patient groups need to know about as well, because if this can be done, then anything is possible.”
Beyond providing additional research support for Findacure and its partners, the hackathon was valuable to Elsevier on two levels, according to Tim Hoctor, VP of Professional Services for R&D Solutions at Elsevier and a member of the Pistoia Alliance Board of Directors.
On the one hand, he said, it underscored the power of collaboration – one of Elsevier’s core values: “It was amazing to watch the teams roll up their sleeves and take creative steps to start pulling the data together to answer a common question, and to observe the frameworks they used for approaching the problem.”
On the other hand, it enabled Elsevier’s education and development teams “to see first-hand the absolute, most common challenge for any end-user researchers culling data for directional insight in developing a drug candidate: They start with a boatload of data somebody else put together as ‘relevant’ for them, and they’re asked to make sense of it.”
Where did that data come from? Is it truly relevant? Even if it is, does it shed light on the disease itself, or a similar disease? Or is a possible mechanism underlying both? Or a potential therapeutic target? “Those are just some of the questions they’ll need to ask,” Hoctor said, “but they can’t do much until they can look at those EMBL-EBI datasets and data from next-generation sequencing, clinical trials and other sources, and collectively harmonize and integrate it. For Elsevier, that was a critical learning.”
“We assume that because we have so much full-text data, just by making it available, researchers will know how to use it,” he added. “But they often don’t, and when they don’t, they bypass it and go to the data they do understand. That leads at best to inefficiencies, and at worst to missing crucial input.”
“The ability to better understand how data is extracted from our publications, tagged with metadata and can best be utilized across the landscape of potential data consumers will inform what our business model looks like in 25 years,” Hoctor explained. “Now we know that model will undoubtedly include machine learning and artificial intelligence computational capabilities and visualizations, which are all two steps beyond what we provide today. And that’s why real-world initiatives such as hackathons are so important.”
The Pistoia Alliance President’s Series Hackathon
The hackathon was a series of five challenges that fostered collaboration among computer and life scientists to aid drug discovery. The challenges were:
1. Help Findacure accelerate treatment and clinical research for Friedreich’s ataxia using Elsevier’s heterogeneous sets of data related to the disease: biological pathway analysis, associated chemical compounds and bioactivities, potential candidates for drug repurposing, full-text scientific literature and clinical trials data.
2. Propose a novel, deep learning pipeline for a model that can accurately predict the in vitro activity of compounds based on their chemical structure using the ExCAPE consortium's machine learning-ready dataset.
3. Support surgeons treating thoracic aortic aneurysm by reconstructing a 3D model of the aorta from digital slices provided by Promeditec.
4. Train a decision tree model to predict the clinical significance of missense mutations using at least three features from Microsoft’s Azure platform, including the BLOSUM score, minor allele frequency and sequence alignment entropy.
5. Gain insights on the patient experience of a particular disease, such as asthma, using machine learning technologies on the Reddit social media platform.
The winning team
Team “In Too Deep” won first prize for its neural network tackling the ExCAPE dataset and its approach of building a prototype cloud implementation to distribute the neural network and data.
The runners up
- Team “Find Me A Drug” tackled Elsevier’s datasets to facilitate the preliminary identification of compounds that may warrant further investigation for Friedrich’s ataxia.
- Team “Tuna Melt,” also taking on the ExCAPE challenge, used an autoencoder to embed data into a neural network and then implemented a multitask learning approach to tackle the unbalanced labels.
- “The CT Guys” created an initial prototype model aorta for a 3D reconstruction.
- Team “Cambridge Stars” built a preliminary mobile app to help people searching for asthma advice.
- The “Mutation Team” deployed a pathological genetic mutation classifier.