Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?

September 2025

Damian Stachura, Joanna Konieczna and Artur Nowak

Abstract

Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models likeDeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whethersmall open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularlyinterested in the context of biomedical question-answering, a domain we explored by participating in Task 13BPhase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performingsystems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answeringcapabilities, we use various techniques including retrieving the most relevant snippets based on embeddingdistance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches toleverage the diverse outputs generated by different models for exact-answer questions. Our results demonstratethat open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassedtheir closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at https://github.com/evidenceprime/BioASQ-13b.

Keywords

Biomedical Question Answering, Large Language Models, Zero-Shot Prompting, Few-Shot Prompting, In-ContextLearning, GPT-4, Claude, Open-Weight LLM, Ensembling