ChatGPT represents a remarkable AI chatbot employing a deep learning model to discern intricate patterns and correlations among words within its extensive training data. By leveraging this capability, it produces responses akin to human-like interactions, driven by the prompts it receives.
However, it is crucial to note that since ChatGPT lacks a definitive source of truth in its training data, there is a possibility of generating responses that may deviate from factual accuracy.
“The use of large language models like ChatGPT is exploding and only going to increase,” remarks lead author Rajesh Bhayana.
This new work offers “insight into ChatGPT’s performance in a radiology context, highlighting the incredible potential of large language models, along with the current limitations that make it unreliable.”
ChatGPT, the rapidly growing consumer application, has recently earned the distinction of being the fastest-growing consumer application in history. Its success has prompted major search engines like Google and Bing to incorporate similar chatbot technology. Physicians and patients are utilizing these chatbots to search for medical information, as highlighted by Dr. Bhayana.
Dr. Bhayana and a team of researchers conducted a comprehensive evaluation of ChatGPT’s performance on radiology board exam questions. Specifically, they focused on GPT-3.5, the most widely used version of ChatGPT. The researchers devised a set of 150 multiple-choice questions that mirrored the style, content, and difficulty level of the Canadian Royal College and American Board of Radiology exams.
To gain insights into ChatGPT’s capabilities, the questions were categorized based on the type of thinking required to answer them. The categories included lower-order thinking, which encompassed knowledge recall and basic understanding, and higher-order thinking, which involved applying, analyzing, and synthesizing information. Within the higher-order thinking category, questions were further classified based on specific types such as describing imaging findings, clinical management, calculation and classification, and disease associations.
The evaluation of ChatGPT’s performance took into account its overall performance as well as its performance across different question types and topics. Additionally, the researchers assessed the confidence level of ChatGPT’s responses to gauge the reliability of the generated language.
According to the research findings, the ChatGPT model based on GPT-3.5 demonstrated a 69% accuracy rate (104 out of 150) in answering questions, which is close to the passing grade of 70% set by the Royal College in Canada. The model showcased notable proficiency in addressing questions that required lower-order thinking, achieving an 84% accuracy rate (51 out of 61). However, it encountered difficulties with questions that involved higher-order thinking, managing only a 60% accuracy rate (53 out of 89). Particularly, it struggled with higher-order questions related to describing imaging findings, with an accuracy rate of 61% (28 out of 46). It also faced challenges with questions involving calculation and classification, achieving an accuracy rate of 25% (2 out of 8), as well as questions requiring the application of concepts, where it scored 30% accuracy (3 out of 10). The model’s lower performance on higher-order thinking questions was not surprising, considering its lack of radiology-specific pretraining.
In March 2023, GPT-4 was introduced in a limited capacity to paid users, with a specific claim of enhanced advanced reasoning capabilities compared to GPT-3.5.
In a subsequent study, GPT-4 demonstrated an improved performance by answering 81% (121 out of 150) of the same questions correctly, surpassing both GPT-3.5 and the minimum passing threshold of 70%. GPT-4 exhibited a significant advancement in handling higher-order thinking questions, achieving an accuracy rate of 81%. Specifically, it excelled in questions involving the description of imaging findings, with an accuracy rate of 85%, and questions requiring the application of concepts, where it achieved an accuracy rate of 90%.
The results indicate that GPT-4’s purported advancements in advanced reasoning abilities have been found to significantly enhance performance within the field of radiology. Furthermore, these findings also demonstrate notable improvements in the model’s contextual comprehension of radiology-specific terminology, encompassing detailed imaging descriptions. Such enhancements are crucial for facilitating the development of various applications downstream.
The findings demonstrate “an impressive improvement in performance of ChatGPT in radiology over a short time period, highlighting the growing potential of large language models in this context,” Dr. Bhayana adds.
GPT-4 failed to demonstrate any advancement in addressing basic cognitive questions, as its performance remained stagnant compared to GPT-3.5, with an accuracy rate of 80% versus 84%. Moreover, it displayed a disconcerting decline by providing incorrect answers to 12 questions that GPT-3.5 had accurately responded to, thereby raising concerns about its dependability in the realm of information retrieval.
“We were initially surprised by ChatGPT’s accurate and confident answers to some challenging radiology questions, but then equally surprised by some very illogical and inaccurate assertions,” Dr. Bhayana adds. “Of course, given how these models work, the inaccurate responses should not be particularly surprising.”
Hallucinations, a potentially hazardous side effect of using ChatGPT, are less common in GPT-4, but they still prevent the tool from being widely used in medical education and practice at this time.
Both tests showed that ChatGPT always spoke with confidence, even when it was wrong. Dr. Bhayana observes that this is particularly risky if used as the only source of knowledge, especially for beginners who may not be able to distinguish between confident and wrong replies.
“To me, this is its biggest limitation. At present, ChatGPT is best used to spark ideas, help start the medical writing process and in data summarization. If used for quick information recall, it always needs to be fact-checked,” Dr. Bhayana comments.
The findings of the study were published in Radiology.
Image Credit: Stanislav Kogiku/SOPA Images/LightRocket via Getty Images