Connect

LLMs as a Jury: Bringing Quality to Quantity in GenAI-Aided R&D

2025년 7월 18일 | 15분 읽기

저자: Ann-Marie Roche

JasonDoiy via Getty Images

As scientific workflows accelerate with AI, one truth remains constant: quality data and human expertise are irreplaceable.

For researchers sifting through millions of documents to find genuinely relevant insights, every tool matters—machine, human or otherwise. Generative AI technologies have revolutionized how we formulate queries in natural language and compile insights from vast datasets. But the ultimate question remains: Can you trust the results?

Ensuring the amazingly right over the amazingly wrong

“Everyone's aware of AI these days. But these models are not silver bullets or foolproof,” said Christian Druckenbrodt, manager of data informatics and quality standards at Elsevier. “You must be especially careful in areas related to science or public health. AI can do amazing things but also deliver surprisingly wrong results. It's our job to ensure that doesn't happen.”

Recently, Druckenbrodt and his team developed an elegant solution to this challenge: rather than trusting a single generative AI to evaluate another AI system’s output, they implemented a “jury” of multiple AIs, with humans serving as tiebreakers when disagreements arise.

Just like human juries work better than individual judges, multiple large language models (LLMs) evaluating together produce better results than a single model. While humans might initially disagree, they typically reach consensus with enough discussion—though this takes time, effort and money. LLMs work similarly: two models agreeing on an evaluation are more reliable than one. The key advantage is that LLM “juries” can scale effectively and process much more content faster than human evaluators alone.

The results speak for themselves. The new system has reduced the workload of human subject matter experts by over 80% for search evaluations, while maintaining rigorous quality standards across Elsevier tools such as the chemical database Reaxys and the medical research database Embase.

A career built on quality data

After earning his Ph.D. in chemistry from Technische Universität Carolo-Wilhelmina Braunschweig, Druckenbrodt joined Elsevier 25 years ago. He began as a subject matter expert focused on content quality and database management. “My interests have always revolved around data quality,” he said.

About a decade ago, he expanded his focus to include AI and machine learning, while still representing the voice of chemists. This work helped consolidate various data repositories into Reaxys, now the world’s largest chemical database, with access to 113 million documents and counting.

“Reaxys opened up the whole chemistry universe,” Druckenbrodt noted. “We made a major breakthrough by combining text extraction with vision models for identifying chemical structures. Now, we’re entering the world of large language models and moving toward agentic AI. Throughout this evolution, we must maintain focus on evaluation and quality assurance.”

Today, Druckenbrodt leads a five-person team that evaluates AI-driven tools across chemistry and biomedical domains. The team collaborates closely with product managers to define quality metrics and determine when solutions are ready for deployment or need further refinement. “There are always small adjustments we can make to improve results,” he said.

"Now, we’re entering the world of large language models and moving toward agentic AI. Throughout this evolution, we must maintain focus on evaluation and quality assurance.”

Christian Druckenbrodt

Elsevier의 Manager of data informatics and quality standards

A whole new industry: LLM evaluations

The world of “LLM evaluations”—automated methods for assessing LLM output quality—is well established in the AI community. For these outputs to be valuable, they must be accurate. That’s why venture capitalists now evaluate AI startups not only on their technology but also on their evaluation infrastructure.

While 80% accuracy might be acceptable in some industries, it falls short in life sciences. That’s why Elsevier backs its LLMs with human subject matter experts—a resource-intensive but necessary approach.

“In early 2024, Elsevier wanted to compare four generative AI models,” Druckenbrodt recalled. “It required 45 subject matter experts working for four weeks. And since these models constantly evolve and new ones regularly emerge, this approach is unsustainable. Our experts’ time and bandwidth are limited, and their regular work becomes impossible when tied up in evaluations for weeks.”

The jury approach: a balance of people and machines

After experimenting with various evaluation methods without success, his team discovered colleagues using LLMs as judges. “We were skeptical about LLMs evaluating other LLMs but decided to experiment. The approach showed real promise,” Druckenbrodt said.

The breakthrough came when they switched from a single judge to a jury system, inspired by a 2024 paper titled “Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.” “We tested various open-source and proprietary models to find the optimal mix,” Druckenbrodt explained. “GPT-4 and Gemma2 showed relatively good results.”

The jury system asks both models to rate responses on accuracy, relevance and other factors using a 1-5 scale. “When both models agree, they accept the rating. When they disagree or face complex questions, a human steps in to make the final decision,” Druckenbrodt said. “There is also human spot-checking on the agreed cases.”

The system checks at three stages:

The interpretation of the original question
The information retrieved to answer it
The final generated response

Beyond safeguards against hallucinations, the system also checks for bias and harmful responses at two stages. It flags inappropriate prompts, biases and potential harm. If issues are found, the system refuses to respond.

Transparency and explainability are enhanced by requiring models to justify their judgments. “These models are now quite good at providing reasoning behind their decisions,” Druckenbrodt said.

Challenges and human oversight

The jury approach has rapidly expanded across Elsevier’s product lineup. “We implemented it for Reaxys R&D search, then moved to Embase, using it for the deep pre-launch evaluation of Embase AI. Next, we established it for PharmaPendium AI. Each new product requires new prompting and validation, but once set, we can operate in high-throughput mode,” Druckenbrodt explained.

Despite its success, the jury system faces challenges. “There’s increased engineering complexity to integrate different models, and prompt alignment is crucial since models may respond differently to the same prompt,” Druckenbrodt acknowledged.

He also noted that, due to the inherent nature of LLMs, the same model can respond differently to identical prompts—a fact he wishes more people understood when demonstrating the system.

As technology advances, so does the jury approach. “Every week or two, we evaluate new models and add better options to our list. This creates a continuous validation cycle,” Druckenbrodt said.

From RAG to agentic AI and beyond

The world is becoming more complex. While recent years have focused on retrieval-augmented generation (RAG) to improve LLM reliability, the toolkit continues to expand. “RAG was a significant step forward,” Druckenbrodt said. “It’s not just the LLM serving as the keeper of knowledge; it taps into other data sources, including knowledge graphs and vector databases. That’s crucial for Elsevier, which prides itself on trustworthy data.”

The next frontier involves agentic systems and multimodal data. “Incorporating different types of results—text, chemical structures, reactions, targets and other structured data—raises the complexity. We’re evolving our test models to include these components,” Druckenbrodt said.

Democratizing scientific discovery

The impact on search has been profound. “Previously, search was limited to those who knew specialized language. Now, anyone can use natural language. We’re democratizing access and making millions of documents, articles, books and chemical substances available to all researchers,” he said.

However, Druckenbrodt emphasizes that data quality remains paramount. “Data is still king. Whether used in training LLMs, RAG systems or agentic platforms, without trustworthy data, you cannot deliver trustworthy results.”

“Data is still king. Whether used in training LLMs, RAG systems or agentic platforms, without trustworthy data, you cannot deliver trustworthy results."

Christian Druckenbrodt

Elsevier의 Manager of data informatics and quality standards

Continuing to build trust

“This work is cutting-edge, and we’ve made great progress. It extends beyond data science and life sciences. Whether it’s ClinicalKey AI for clinicians or ScienceDirect AI for peer-reviewed literature, it’s all about helping researchers save time. The jury approach enables us to do this in a structured way while managing quality and effort,” he explained.

As AI continues to reshape scientific research, one thing remains certain: the most powerful systems combine artificial intelligence with human expertise, ensuring that the pursuit of speed never compromises the pursuit of truth.

기여자

LLMs as a Jury: Bringing Quality to Quantity in GenAI-Aided R&D

Ensuring the amazingly right over the amazingly wrong

A career built on quality data

A whole new industry: LLM evaluations

The jury approach: a balance of people and machines

The system checks at three stages:

Challenges and human oversight

From RAG to agentic AI and beyond

Democratizing scientific discovery

Continuing to build trust

Related Stories

기여자

Ann-Marie Roche