GPT-4 performs close to the level of expert doctors in eye assessments

As language learning models (LLMs) continue to advance, questions about how they can benefit society in areas such as the medical field also evolve. A recent study from the University of Cambridge School of Clinical Medicine found that OpenAI’s GPT-4 performed almost as well in an ophthalmology assessment as experts in the field, THE Financial […]

GPT-4 performs close to the level of expert doctors in eye assessments

As language learning models (LLMs) continue to advance, questions about how they can benefit society in areas such as the medical field also evolve. A recent study from the University of Cambridge School of Clinical Medicine found that OpenAI’s GPT-4 performed almost as well in an ophthalmology assessment as experts in the field, THE Financial Times reported for the first time.

In the study, published in PLOS Digital Health, the researchers tested the LLM, its predecessor GPT-3.5, Google’s PaLM 2 and Meta’s LLaMA with 87 multiple-choice questions. Five expert ophthalmologists, three trainee ophthalmologists and two young non-specialist doctors underwent the same mock examination. The questions came from a manual intended to test trainees on everything from light sensitivity to lesions. The content is not publicly available, so the researchers believe LLMs could not have been trained before. ChatGPT, equipped with GPT-4 or GPT-3.5, had three chances to respond definitively or its response was marked as zero.

GPT-4 performed better than trainees and junior doctors, answering 60 of 87 questions correctly. Although this figure is significantly higher than the junior doctors’ average of 37 correct answers, it narrowly exceeds the three trainees’ average of 59.7. While one expert ophthalmologist only accurately answered 56 questions, all five had an average score of 66.4 correct answers, beating the machine. PaLM 2 scored 49 and GPT-3.5 scored 42. LLaMa scored the lowest at 28, falling below Junior Doctors. Notably, these trials took place in mid-2023.

Although these findings present potential benefits, they also carry many risks and concerns. The researchers noted that the study offered a limited number of questions, particularly in certain categories, meaning that actual results could vary. LLMs also tend to “hallucinate” or make things up. It’s one thing if it’s not a relevant fact, but claiming there’s a cataract or cancer is another story. As is the case with many LLM use cases, the systems also lack nuance, creating new opportunities for inaccuracy.

Teknory