Capability of large language models in assisting GPs with diagnoses

Authors: Wang, R., Rehman, A., Li, T., Page, R., Li, H., Wang, X., Yang, X., Zhang, J.J.

Journal: Applied Intelligence

Publication Date: 01/04/2026

Volume: 56

Issue: 5

eISSN: 1573-7497

ISSN: 0924-669X

DOI: 10.1007/s10489-025-06827-1

Abstract:

Purpose: A decision support pathway for general practitioners (GPs) was explored through automated referral letter analysis, with large language models’ (LLMs) diagnostic roles comprehensively evaluated. Methods: The in-context learning performance of ChatGPT and GPT-4 for diagnostic decision support was evaluated using referral letters. Synthetic referral letters generated by ChatGPT addressed data scarcity, with distributional congruence quantified via Kullback-Leibler divergence. Two fine-tuning frameworks were comparatively assessed: encoder-based pre-trained language models (PLMs) for diagnostic classification, and decoder-based LLMs adapted to multiple-choice question-answering paradigms. Results: GPT-4 showed suboptimal few-shot accuracy (0.544). Synthetic letters demonstrated high fidelity (KL-divergence<0.05). Encoder-based PLMs consistently outperformed decoder-based LLMs when fine-tuned with augmented data, with BERT achieving 0.977 accuracy in mixed-train-collect-test protocols. Complementary F1 (0.9707) confirmed negligible diagnostic bias. Conclusion: LLMs exhibited insufficient diagnostic accuracy through both direct implementation (GPT-4 few-shot: 0.544) and fine-tuning approaches (accuracy 0.723), establishing fundamental limitations in clinical deployment. Crucially, their text-generation capability was leveraged for structured data augmentation, producing synthetic referral letters with high distributional fidelity (KL-divergence<0.05). This validated methodology enabled superior diagnostic performance through encoder-based PLM fine-tuning, where BERT achieved near-clinical-utility accuracy (0.977) - demonstrating 25.4% relative improvement over best-performing LLMs. Implementation pathways consequently prioritize this hybrid framework: LLM-mediated data augmentation followed by resource-efficient PLM classifiers, currently undergoing neurologist-piloted validation before multicenter expansion.

Source: Scopus