Worth its weight in gold: getting your legacy data in order
21 mars 2024
Par Ann-Marie Roche
Daniel Allan/Image Source via Getty Images
Your R&D-driven company is sitting on reams of research data stashed in countless silos and formats — and laced with ever-evolving jargon. Where do you begin when you want to set your data free?
The expression “data is the new gold” resonates with the people of Johnson Matthey S’ouvre dans une nouvelle fenêtre — a global company that started as a gold assayer for the Bank of England in 1817. Now focused on sustainable technologies that are catalyzing the net zero transition, “JM” has just achieved a milestone in safeguarding its intellectual property and making it accessible for researchers and algorithms to trigger future innovation.
Using Elsevier’s SciBite, JM is taking control of its “unstructured data problem” with data science and AI technologies unlocking — and interconnecting — all this knowledge. We spoke with three digital players at JM about the journey.
Webinar: Foundations for effective AI
Dive deeper into how Johnson Matthey is leveraging data, technology and tooling to pursue sustainable technologies using SciBite. Dr Nathan Barrow, Ed Wright and Owen Jones talk about how they got the foundational elements in place to drive more effective AI-driven outcomes.
Digital power trio
“Getting your data house in order is similar to what they say about growing a tree: the best time to start is 20 years ago,” said Principal Information Analyst Ed Wright S’ouvre dans une nouvelle fenêtre. “The second-best time is right now.”
Ed, along with a number of other in-house digital champions, saw the potential — and long-term value — of organizing the company’s data using the FAIR S’ouvre dans une nouvelle fenêtre data principles of Findability, Accessibility, Interoperability and Reusability.
Leading the charge, Dr Nathan Barrow S’ouvre dans une nouvelle fenêtre oversees the overall digital transformation of JM’s R&D space — seeking out the most suitable technologies while driving the cultural change required to maximize this tech. In other words, he helps employees transition from all those legacy systems to more modern ones — while keeping the data in these older systems accessible.
As Data Science Strategy Lead for one of JM’s teams, Owen Jones S’ouvre dans une nouvelle fenêtre oversaw a specific use case. He worked to establish the pipeline that prepared and brought together different data sources that span a dizzying amount of time, formats and terminologies.
Meanwhile, Ed was the driving force in untangling the complexities of the chemicals industry to build standardized vocabularies and ontologies to organize this disparate data into a unified — and FAIR — whole.
Making data part of the culture
While many were involved, these three represent the core challenges a company is likely to encounter when taking control of its data. And all three regard what they’ve already achieved as a career highlight.
“One of the things I am most proud of is getting the SciBite platform into Catalyst Technologies S’ouvre dans une nouvelle fenêtre and raising the awareness that we had an unstructured data problem that we need to tackle,” said Owen.
“We managed to turn this into a thing and change the culture,” Ed added.
And now, with the proof-of-concepts and first use cases deemed a success, it’s onward and upward to other data deposits and other departments — while bringing the latest AI innovations into the loop.
Cutting-edge for two centuries
JM has been innovating for over 200 years. The company expanded from assaying gold to refining precious metals and beyond — pivoting as new challenges and opportunities arose. Today, they rate as a global leader in sustainable technologies that help some of the world’s leading energy, chemicals and automotive companies decarbonize, reduce emissions and achieve their sustainability goals. “We've proven we’re not afraid to make these kinds of big changes to maintain a more focused strategy,” Nathan said.
“Our expertise in precious metals still underpins our technologies: it’s about making every milligram count,” Ed explained. “And with every application being quite niche, these all require their own development and innovation. And we’re still very tightly linked to this idea of questioning how we can make the most of everything. How can we adjust it, tweak it and keep moving it forward?”
In this way, employees were very open to trying something new if it meant streamlining their research.
However, SciBite still needed to prove its worth.
So much information, so little access
“Chemists have been filling up notebooks — both paper and digital versions — for decades and decades,” noted Nathan. “And since this was highly valuable intellectual property, they were sent off-site to a locked container. This, of course, makes it very difficult to actually go back and find the information that you’re looking for. And when your colleagues retire, it’s almost impossible to actually go back and find the important information they captured so diligently 20 years ago — but that’s still relevant for today.
“To avoid replicating all of that clever work that happened before, my job is to digitalize the chemistry and science that happens at JM. I am bringing in new tools and software so that the data is all captured not by the chemists but automatically by the instruments collecting the data. Then, the chemists can add extra information in terms of context and whether they thought the experiment worked or not.”
But while researchers switched to this new electronic lab notebook, there were still two obsolete electronic lab notebooks with legacy information on them. “These databases represent over 16 years and countless millions worth of research,” Nathan said. “So we needed a way for people to search and find the documents they needed — while going beyond a simple search on title or possibly abstract. Other solutions just didn’t do the job and lacked semantic search capabilities.”
The fallibility of human search engines
Meanwhile, the problem went beyond just two obsolete electronic lab notebook systems. “When it comes down to it, JM has got a huge wealth of knowledge stored in a few individuals,” said Ed. “And the way you work is you go and have a chat with that person. And then they'll point you in the right direction — perhaps towards a certain report in a certain filing cabinet.
“But when the (COVID) lockdown happened, either you could no longer get to that person, or they could not get into their filing cabinet. Suddenly, there was this realization: ‘Hold on, we can't work this way anymore. And, actually, these people are also moving steadily toward retirement. What happens then?’”
When a plan comes together
Happily, it soon became apparent that SciBite had the solution they needed. “I had already encountered SciBite at a conference around 2016, so quite a while ago,” said Ed. “And it’s been a progression since then with the pace picking up with COVID.”
“We could lay the foundation during the lockdown period — when people had more time on their hands. Everything seemed to converge nicely,” said Nathan. “As part of our proof of concept, we could put both notebooks into a single server, and since the information was now FAIR — and accessible for all — people had access to not only their information but more information. So suddenly, there was less barrier of letting go of their old system since they could all access their data.”
“Everyone was enthusiastic from the very first test when they could search for their own obscure terms that only existed in the JM universe,” said Ed. “People were very interested, and we realized this is absolutely the right tool and this gave us the confidence to deploy further.”
A very specific (but universal) use case
“I was actually convinced when I first saw SciBite’s demo video S’ouvre dans une nouvelle fenêtre,” said Owen. “It was easy and straightforward to use. And I must say, since we have 1,300 researchers, it was appealing that we could license it for the whole department and not by user.”
“We in the Catalyst Technologies department had very much the same issues as the rest of the company,” said Owen. “Namely, collating legacy documents and figuring out how we can find all these old documents and bits of knowledge? Just do the math: many of those 1,300 scientists have been working here 20-30-40-50 years. That’s a lot of reports. And they are scattered throughout our digital infrastructure.”
“And with the central R&D problem around replacing old electronic lab notebook systems, we quickly realized SciBite could also solve our problems,” said Owen. “Now it’s been rolled out for six months with over 300,000 documents. People are using it and finding stuff they couldn’t find before. And we want to keep adding new data sources. We actually don’t even know what our 100% is. People are still coming and saying, ‘Hey, we have this library over here where we’ve been keeping documents for the past 20 years.’”
Owen is also eyeing the “orange notebooks” of lore — those notebooks that documented all the experiments from the pre-digital age. “As you can imagine, these handwritten notebooks can be a mess, but we’ve already done some extraction experiments, and I am hopeful we can get there.”
DIY ontologies
The project’s biggest challenge was, and remains, building the ontologies — the actual codification of all of JM’s facts. And while the SciBite team helped lay the groundwork for this process, it became a largely in-house effort. “SciBite is quite life-sciences focused, so a lot of the built-in ontologies are not applicable to catalyst technology,” Owen said. “This is something that really kept Ed busy.”
“It’s really part of my larger job as Principal Information Analyst,” Ed noted. “I work in a team that essentially provides intelligence for the company. My role is to see how we can use all of the new data becoming available through government and other open-source resources. I also see how we can use digital tools to better work with more conventional sources such as patents.”
In this case, the intelligence gathering is happening inside the company. “And to move forward, you need to do the standardization; that’s where you get into the ontologies, for which SciBite’s CENtree Ontology Manager S’ouvre dans une nouvelle fenêtre is very useful,” Ed explained. “This in turn moves you into the world of knowledge graphs S’ouvre dans une nouvelle fenêtre, where you can connect equivalent concepts across different data sources.
“And this is essential for a company like JM. We’re full of jargon. Every department has its own terminology. And we have 200 years of evolving jargon and 200 years of mergers, acquisitions and divestments — all with their own systems and nomenclature. So there’s a lot to sort out.”
Happily, once it’s sorted, it’s done.
Short-term drudgery for long-term payback
But how do you avoid this relative drudgery of embedding this metadata — all that data that organizes your data — for the future?
“People who are writing reports today need to think more about how someone reads and uses their report in the future,” said Owen. “How are these readers going to find it? How are they going to reuse your data and your knowledge?
“That’s the tricky part,” said Nathan. “In our new system, we are asking our scientists to add more metadata and context to their experiments. Once we’ve got a critical mass of information in the system, we can start using that structured data and layering it with AI. And then their lives are going to be a lot faster and easier. But our researchers are not feeling this yet. But we know we’ll get there!”
Onward and upward
In fact, with success, JM can expand on its ambitions. “I’d like to see more documents, reports and data sources, with more parts of JM starting to adopt it,” said Owen. “Technically, it would also be nice to see generative AI put on top to make it even more accessible. Using SciBite as the retrieval piece of a (retrieval augmented generation) RAG S’ouvre dans une nouvelle fenêtre system helps us reuse all our semantic knowledge and document sets with new AI tools. And that’s something we’re planning to do internally.”
“I hope all of the company’s main teams will have their taxonomies and ontologies in the next few years,” said Ed. “And I’d love for us to have a knowledge graph based on the tagging and everything we do in the back end. Then, we can start to put more of these AI approaches over the top end of it and really make all of this unstructured data readily accessible to the latest data science approaches.”
Advice to other legacy companies
“I would say: think big, but start small,” said Nathan. “Have a well-defined small use case that you can show success with, and then move from there. You can’t boil the ocean.”
Ed agrees. “I think it’s about accepting that it’s a journey — and that it’s better not to put it off. Plant that tree now!”
After all, like a gold mine, it won’t dig itself.