ChatGPT is mediocre at diagnosing medical conditions, with just 49% accuracy, according to a new study. The researchers say their findings show that AI should not be the sole source of medical knowledge and highlight the importance of preserving the human element in healthcare.
The ease of access to online technology means some people are skipping out on seeing a medical professional and choosing to Google their symptoms instead. While being proactive about one’s health is not a bad thing, ‘Dr. Google’ isn’t all that accurate. A 2020 Australian study examining 36 international mobile and web-based symptom checkers found that an accurate diagnosis was listed first only 36% of the time.
Sure, AI has improved since 2020. Yes, it definitely has. OpenAI’s ChatGPT has made great strides — it can pass the US Medical Licensing Exam, after all. But does that make it better than Dr. Google in terms of diagnostic accuracy? That’s the question researchers at Western University in Canada set out to answer in a new study.
Using ChatGPT 3.5, a large language model (LLM) trained on a massive dataset of over 400 billion words from across the Internet from sources such as books, articles, and websites, the researchers conducted a qualitative analysis of the medical information provided by the chatbot in response to the Medscape Case Challenges.
Medscape Case Challenges are complex clinical cases that challenge a medical professional’s knowledge and diagnostic skills. Medical professionals must make a diagnosis or select an appropriate treatment plan for a case by selecting one of four multiple-choice answers. The researchers selected Medscape’s Case Challenges because they are open source and freely available. To prevent the possibility of ChatGPT having prior knowledge of the cases, only those written after training on model 3.5 in August 2021 were included.
A total of 150 Medscape cases were analyzed. With four multiple-choice answers per case, there were 600 possible answers in total, with only one correct answer per case. The analyzed cases covered a wide range of medical conditions, with titles such as “Beer, Aspirin Worsen Nasal Problems in 35-Year-Old Asthmatic,” “Gastro Case Challenge: 33-Year-Old Man Who Can’t Swallow His Own Saliva,” “27-Year-Old Woman With Constant Headaches and Too Tired from Partying,” “Pediatric Case Challenge: 7-Year-Old Boy With Limping and Obesity Who Fell on the Street,” and “Aerobics-Loving Accountant With Hiccups and Incoordination.” Cases with visual assets, such as clinical images, medical photographs, and charts, were excluded.
To ensure consistency in the input provided to ChatGPT, each case challenge was compiled into a single standardized prompt, including a script of the output the chatbot would provide. All cases were evaluated by at least two independent raters, medical trainees, who were blind to each other's responses. They evaluated ChatGPT responses on the basis of diagnostic accuracy, cognitive load (i.e., complexity and clarity of information provided from low to high), and quality of medical information (including whether it was complete and relevant).
Of the 150 Medscape cases analyzed, ChatGPT provided correct answers in 49% of cases. However, the chatbot showed 74% accuracy overall, meaning it could identify and reject incorrect multiple choice options.
“This higher value is due to ChatGPT’s ability to identify true negatives (wrong choices), which contributes significantly to overall accuracy, increasing its usefulness in eliminating wrong choices,” the researchers explain. “This difference highlights ChatGPT’s high specificity, demonstrating its superior ability to eliminate misdiagnoses. However, it needs improvement in sensitivity and precision to reliably identify the correct diagnosis.”
Additionally, ChatGPT yielded false positives (13%) and false negatives (13%), which has implications for its use as a diagnostic tool. Just over half of the answers (52%) were complete and relevant, while 43% were incomplete but still relevant. ChatGPT tended to produce answers with low (51%) to medium (41%) cognitive load, making them understandable to users. However, this comprehensibility, combined with the potential for incorrect or irrelevant information, could lead to “misunderstandings and a false sense of understanding,” especially if ChatGPT is used as a medical education tool, the researchers note.
“ChatGPT also had difficulty distinguishing between diseases with different presentations, and the model occasionally produced incorrect or illogical information, known as AI hallucinations. This highlights the risk of relying solely on ChatGPT for medical guidance and the need for human expertise in the diagnostic process,” the researchers wrote.
Of course — and the researchers note this as a limitation of the study — ChatGPT 3.5 is a single AI model that may not be representative of other models, and it’s inevitable that it will need to be improved in future iterations, which could increase its accuracy. Also, the Medscape cases analyzed by ChatGPT focused primarily on differential diagnosis cases, where medical professionals need to distinguish between two or more conditions with similar signs or symptoms.
Although future research should evaluate the accuracy of different AI models using a wider range of case sources, the results of the current study are nevertheless instructive.
“The combination of high relevance and relatively low accuracy suggests that you should not rely on ChatGPT for medical advice, as it may provide important information that may be misleading,” the researchers wrote. “While our results show that ChatGPT consistently delivers the same information to different users and demonstrates significant inter-rater reliability, they also reveal the tool’s shortcomings in providing factually accurate medical information, which is evident in [sic] “Because of its low diagnostic accuracy.”
The study was published in the journal PLOS One.