Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study

Arun James Thirunavukarasu; Shathar Mahmood; Andrew Malem; William Paul Foster; Rohan Sanghera; Refaat Hassan; Sean Zhou; Shiao Wei Wong; Yee Ling Wong; Yu Jeat Chong; Abdullah Shakeel; Yin-Hsi Chang; Benjamin Kye Jyn Tan; Nikhil Jain; Ting Fang Tan; Saaeha Rauz; Daniel Shu Wei Ting; Darren Shu Jeng Ting

doi:10.1371/journal.pdig.0000341

Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study

Arun James Thirunavukarasu^*, Shathar Mahmood, Andrew Malem, William Paul Foster, Rohan Sanghera, Refaat Hassan, Sean Zhou, Shiao Wei Wong, Yee Ling Wong, Yu Jeat Chong, Abdullah Shakeel, Yin-Hsi Chang, Benjamin Kye Jyn Tan, Nikhil Jain, Ting Fang Tan, Saaeha Rauz, Daniel Shu Wei Ting, Darren Shu Jeng Ting^*

^*Corresponding author for this work

Inflammation and Ageing

Research output: Contribution to journal › Article › peer-review

63 Downloads (Pure)

Abstract

Large language models (LLMs) underlie remarkable recent advanced in natural language processing, and they are beginning to be applied in clinical contexts. We aimed to evaluate the clinical potential of state-of-the-art LLMs in ophthalmology using a more robust benchmark than raw examination scores. We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64–90%), ophthalmology trainees (median 59%, range 57–63%), and unspecialised junior doctors (median 43%, range 41–44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types (p > 0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher (p < 0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted.

Original language	English
Article number	e0000341
Number of pages	16
Journal	PLOS digital health
Volume	3
Issue number	4
DOIs	https://doi.org/10.1371/journal.pdig.0000341
Publication status	Published - 17 Apr 2024

Bibliographical note

Funding:
DSWT is supported by the National Medical Research Council, Singapore (NMCR/HSRG/0087/2018; MOH-000655-00; MOH-001014-00), Duke-NUS Medical School (Duke-NUS/RSF/2021/0018; 05/FY2020/EX/15-A58), and Agency for Science, Technology and Research (A20H4g2141; H20C6a0032). DSJT is supported by a Medical Research Council/Fight for Sight Clinical Research Fellowship (MR/T001674/1). These funders were not involved in the conception, execution, or reporting of this review.

Access to Document

10.1371/journal.pdig.0000341Licence: Creative Commons: Attribution (CC BY)

ThirunavukarasuA2024LargeFinal published version, 668 KBLicence: Creative Commons: Attribution (CC BY)

Cite this

Thirunavukarasu, A. J., Mahmood, S., Malem, A., Foster, W. P., Sanghera, R., Hassan, R., Zhou, S., Wong, S. W., Wong, Y. L., Chong, Y. J., Shakeel, A., Chang, Y.-H., Tan, B. K. J., Jain, N., Tan, T. F., Rauz, S., Ting, D. S. W., & Ting, D. S. J. (2024). Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLOS digital health, 3(4), Article e0000341. https://doi.org/10.1371/journal.pdig.0000341

@article{7838991ed97a491dae32b810488f77e5,

title = "Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study",

abstract = "Large language models (LLMs) underlie remarkable recent advanced in natural language processing, and they are beginning to be applied in clinical contexts. We aimed to evaluate the clinical potential of state-of-the-art LLMs in ophthalmology using a more robust benchmark than raw examination scores. We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64–90%), ophthalmology trainees (median 59%, range 57–63%), and unspecialised junior doctors (median 43%, range 41–44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types (p > 0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher (p < 0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted.",

author = "Thirunavukarasu, {Arun James} and Shathar Mahmood and Andrew Malem and Foster, {William Paul} and Rohan Sanghera and Refaat Hassan and Sean Zhou and Wong, {Shiao Wei} and Wong, {Yee Ling} and Chong, {Yu Jeat} and Abdullah Shakeel and Yin-Hsi Chang and Tan, {Benjamin Kye Jyn} and Nikhil Jain and Tan, {Ting Fang} and Saaeha Rauz and Ting, {Daniel Shu Wei} and Ting, {Darren Shu Jeng}",

note = "Funding: DSWT is supported by the National Medical Research Council, Singapore (NMCR/HSRG/0087/2018; MOH-000655-00; MOH-001014-00), Duke-NUS Medical School (Duke-NUS/RSF/2021/0018; 05/FY2020/EX/15-A58), and Agency for Science, Technology and Research (A20H4g2141; H20C6a0032). DSJT is supported by a Medical Research Council/Fight for Sight Clinical Research Fellowship (MR/T001674/1). These funders were not involved in the conception, execution, or reporting of this review.",

year = "2024",

month = apr,

day = "17",

doi = "10.1371/journal.pdig.0000341",

language = "English",

volume = "3",

journal = "PLOS digital health",

issn = "2767-3170",

publisher = "Public Library of Science (PLOS)",

number = "4",

}

Thirunavukarasu, AJ, Mahmood, S, Malem, A, Foster, WP, Sanghera, R, Hassan, R, Zhou, S, Wong, SW, Wong, YL, Chong, YJ, Shakeel, A, Chang, Y-H, Tan, BKJ, Jain, N, Tan, TF, Rauz, S, Ting, DSW & Ting, DSJ 2024, 'Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study', PLOS digital health, vol. 3, no. 4, e0000341. https://doi.org/10.1371/journal.pdig.0000341

TY - JOUR

T1 - Large language models approach expert-level clinical knowledge and reasoning in ophthalmology

T2 - A head-to-head cross-sectional study

AU - Thirunavukarasu, Arun James

AU - Mahmood, Shathar

AU - Malem, Andrew

AU - Foster, William Paul

AU - Sanghera, Rohan

AU - Hassan, Refaat

AU - Zhou, Sean

AU - Wong, Shiao Wei

AU - Wong, Yee Ling

AU - Chong, Yu Jeat

AU - Shakeel, Abdullah

AU - Chang, Yin-Hsi

AU - Tan, Benjamin Kye Jyn

AU - Jain, Nikhil

AU - Tan, Ting Fang

AU - Rauz, Saaeha

AU - Ting, Daniel Shu Wei

AU - Ting, Darren Shu Jeng

N1 - Funding: DSWT is supported by the National Medical Research Council, Singapore (NMCR/HSRG/0087/2018; MOH-000655-00; MOH-001014-00), Duke-NUS Medical School (Duke-NUS/RSF/2021/0018; 05/FY2020/EX/15-A58), and Agency for Science, Technology and Research (A20H4g2141; H20C6a0032). DSJT is supported by a Medical Research Council/Fight for Sight Clinical Research Fellowship (MR/T001674/1). These funders were not involved in the conception, execution, or reporting of this review.

PY - 2024/4/17

Y1 - 2024/4/17

N2 - Large language models (LLMs) underlie remarkable recent advanced in natural language processing, and they are beginning to be applied in clinical contexts. We aimed to evaluate the clinical potential of state-of-the-art LLMs in ophthalmology using a more robust benchmark than raw examination scores. We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64–90%), ophthalmology trainees (median 59%, range 57–63%), and unspecialised junior doctors (median 43%, range 41–44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types (p > 0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher (p < 0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted.

AB - Large language models (LLMs) underlie remarkable recent advanced in natural language processing, and they are beginning to be applied in clinical contexts. We aimed to evaluate the clinical potential of state-of-the-art LLMs in ophthalmology using a more robust benchmark than raw examination scores. We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64–90%), ophthalmology trainees (median 59%, range 57–63%), and unspecialised junior doctors (median 43%, range 41–44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types (p > 0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher (p < 0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted.

U2 - 10.1371/journal.pdig.0000341

DO - 10.1371/journal.pdig.0000341

M3 - Article

SN - 2767-3170

VL - 3

JO - PLOS digital health

JF - PLOS digital health

IS - 4

M1 - e0000341

ER -

Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study

Abstract

Bibliographical note

Access to Document

Fingerprint

Cite this