Are LLMs Enough?

October 2025
Lee Clewley, VP, Abhishek Choudhary, Principal Data Engineer, Anette Dalsgaardjakobsen, Sr Business Solutions Manager, Calvin Chan

Abstract

Are LLMs enough? Reflections from a panel on AI in life sciences

As part of the BioTechX EU 2025 conference, on Monday, 6 October 2025, our CTO, Artur Nowak, joined a panel discussion titled “Are LLMs enough?” in the Large Language Models (Theatre 5) track.

The session brought together leaders from across the life sciences and analytics ecosystem:

  • Lee Clewley, VP, AI & Informatics, Tangram Therapeutics
  • Abhishek Choudhary, Principal Data Engineer, Bayer AG
  • Anette Dalsgaardjakobsen, Sr Business Solutions Manager, Global Health and Life Sciences Customer Advisory, SAS Software Limited
  • Calvin Chan, Product Manager, Medical Affairs Analytics, Roche

The session was moderated by Ioannis Spyroglou, Associate Director, Data Science, MRL QA Analytics & Insights, MSD.

Together, they explored what it really means to say that large language models are, or are not, “enough” for high-stakes scientific and medical work.

Enough for what? LLMs as components rather than silver bullets

Artur opened his remarks by reframing the seemingly simple question in the panel title: Enough for what?

For open-ended, general intelligence capable of handling any task, his view is cautious. LLMs used alone as all-purpose problem solvers are unlikely to satisfy the expectations many people have around Artificial General Intelligence.

Where they shine is very different: as one component in a carefully designed system dedicated to a concrete problem. Artur highlighted architectures that combine LLMs with:

  • Retrieval augmented generation (RAG) is grounded in curated knowledge
  • Independent verifiers that can challenge or confirm model outputs
  • Agent-like workflows that break problems into smaller steps
  • Human experts in the loop at the right decision points

In this context, he pointed to evidence synthesis and Laser AI as a prime example. When a model is constrained to high-quality evidence, supported by robust retrieval and validation, and reviewed by domain experts, it becomes a powerful engine for accelerating systematic review and similar workflows. In such systems, LLMs look far closer to “enough” for the job they were actually designed to do.

Before judging LLMs, know your human baseline

Another key theme was the question of baseline performance.

We often ask whether LLMs are good enough without a clear picture of how well humans are doing on the same task. In many areas of scientific and regulatory practice, human performance is:

  • Highly variable
  • Poorly measured
  • Difficult to reproduce at scale

If we do not know the true quality and consistency of current human processes, it becomes very hard to say whether an LLM enabled system is falling short, matching, or even exceeding the status quo.

Artur argued that better benchmarking of human workflows is not a nice-to-have. It is a prerequisite for serious conversations about the readiness of AI tools in science and medicine.

Maybe they are enough, but perhaps too much

The panel also touched on an important paradox. Even if LLM-based systems become “good enough” in terms of accuracy and utility, they may be “too much” in terms of the risk surface.

Artur focused on three areas where LLMs can actually help raise the bar, provided they are used correctly:

  1. Security. Well-designed systems can keep sensitive data local, restrict model access, and log all interactions, helping organizations understand and control how information flows through AI tools.
  2. Compliance. Automated capture of prompts, retrieved sources, and outputs can support audit trails that are often missing from manual processes. This is particularly important in regulated industries.
  3. Scientific integrity. By insisting that models show their work through linked evidence and structured reasoning, teams can enforce higher transparency than many traditional narrative reports.

Here, LLMs are not just “not making things worse”. When implemented thoughtfully, they can actively improve governance and integrity compared to opaque human-only workflows.

Hard questions that shaped the discussion

The other questions that were raised during the debate were:

  • If today’s model architectures were frozen, what extra ingredients would you need before trusting them with open-ended scientific reasoning? Artur pointed to tools, structured world models, persistent memory, and external verification as likely prerequisites before we ask models to handle complex, high-impact scientific questions without tight constraints.
  • What are the top worries regulators raise about AI? Concerns such as lack of traceability, unclear accountability, and difficulties in validating complex systems repeatedly came up. Addressing these is central if we want AI to support, rather than disrupt, regulatory trust.
  • How to design end-to-end systems where data pipelines, orchestration layers, metadata management, and evaluation frameworks are as important as the model itself
  • The tension between improving raw model reasoning and ensuring traceability and provenance when decisions affect patients and major investments
  • Whether the future lies in a single, ever-larger model or in ecosystems of specialized models and agents collaborating on complex tasks
  • The cultural and organizational changes are needed so that GenAI can create synergy across business areas rather than staying trapped in silos
  • The importance of the data science mindset and solution design outweighs the hype cycle around the latest model release

Taken together, the conversation painted a picture of LLMs not as a magic wand, but as a versatile new component in scientific tooling that must be embedded in robust systems, teams, and governance structures.

Our takeaway

From Evidence Prime’s point of view, the answer to “Are LLMs enough?” is:

  • No, as standalone general problem solvers
  • Increasingly yes, as carefully governed components in systems designed for specific, well-understood problems

This is exactly how we approach AI for evidence synthesis and related workflows. If you are interested in how these ideas translate into concrete products and implementations, please reach out to us!

Blank white image with no visible content or details