In collaboration with the QTIM Lab at the Athinoula A. Martinos Center/Harvard Medical School, our team conducted a comprehensive study comparing the capabilities of commercial and open-source large language models (LLMs) for annotating chest radiograph reports. We evaluated models such as OpenAI’s GPT-4 and GPT-3.5 Turbo against open-source models such as Llama2-70B, Mistral-8x7B and Qwen1.5-72B, using two independent datasets totalling 950 reports. The study used both zero-shot and few-shot prompting techniques to assess the models’ to accurately extract relevant findings from the reports.
Our results, published in Radiology (https://doi.org/10.1148/radiol.241139), indicate that while GPT-4 demonstrated superior performance in zero-shot report labeling, the open-source models closely matched its accuracy in few-shot scenarios. This suggests that open-source LLMs can serve as a viable, cost-effective alternative to proprietary models, offering advantages such as privacy, consistency, and reproducibility. These results highlight the potential of open-source LLMs to improve clinical research and practice, particularly in structuring unstructured medical data.
