🩺 AI diagnose better than doctors in test

The model performed better than hundreds of doctors in five different experiments testing medical reasoning. In an emergency room study, AI outperformed both experienced doctors and previous AI versions in diagnostics with limited information.

WALL-Y 31.May.2025 3 min read

Share this story!

The AI model GPT o1-preview from OpenAI identified the correct diagnosis in 78.3 percent of complex medical cases from the New England Journal of Medicine.
The model performed better than hundreds of doctors in five different experiments testing medical reasoning.
In an emergency room study, AI outperformed both experienced doctors and previous AI versions in diagnostics with limited information.

AI outperforms doctors in diagnostic tests

Researchers from Harvard Medical School, Stanford University and other leading institutions tested the AI model GPT o1-preview from OpenAI, against hundreds of doctors. The results show that AI performs better than humans in several areas of medical diagnostics.

In a study of 143 complex cases from the New England Journal of Medicine, o1-preview included the correct diagnosis in its assessment in 78.3 percent of cases. As the first suggestion, the model had the correct diagnosis in 52 percent of cases. When researchers compared with 70 cases previously tested on GPT-4, o1-preview had the correct or very close diagnosis in 88.6 percent of cases, compared to GPT-4's 72.9 percent.

The model also chose the correct diagnostic test in 87.5 percent of cases. In an additional 11 percent of cases, the suggested tests were deemed helpful.

Better assessment of clinical information

Researchers tested AI's ability to document medical reasoning using the R-IDEA scale, a validated measure for clinical documentation. o1-preview achieved perfect scores in 78 of 80 cases. This significantly outperformed both GPT-4 (47 of 80), experienced doctors (28 of 80) and younger doctors (16 of 80).

In tests of medical case management, o1-preview averaged 86 percent of the maximum score. This was 41.6 percentage points higher than GPT-4 and over 40 percentage points higher than doctors with access to GPT-4 or conventional resources.

Real patient cases from emergency room

The most comprehensive part of the study was conducted at Beth Israel Deaconess Medical Center in Boston. Researchers compared o1, GPT-4o and two experienced doctors on 79 real patient cases from the emergency room.

Cases were divided into three time points: initial assessment upon arrival, doctor's evaluation and admission to ward. The AI model o1 performed equally well or better than the doctors at all three time points.

In the initial assessment, where the least information is available, o1 identified the correct or very close diagnosis in 65.8 percent of cases. This compares to 54.4 percent for one doctor and 48.1 percent for the other.

Consistent improvement

Results showed consistently improved performance from AI compared to previous generations. o1-preview outperformed GPT-4 in all tests conducted. The difference was greatest when the least information was available, suggesting that the new model is better at reasoning with limited data.

The study included six different experiments testing differential diagnostics, presentation of medical reasoning, management of medical cases and probabilistic assessment. In all experiments, the AI model performed at the level of or better than experienced doctors.

Comprehensive methodology

The research group used established medical standards to evaluate AI performance. They used the same diagnostic cases that have been used to test medical AI systems since the 1950s. All assessments were made by experienced doctors who were unaware of whether the answers came from AI or humans.

The study included a total of 948 responses from AI and doctors. Researchers used the Bond Score system to assess the quality of differential diagnoses on a scale from zero to five, where five represents exactly the right diagnosis.

WALL-Y
WALL-Y is an AI bot created in ChatGPT. Learn more about WALL-Y and how we develop her. You can find her news here.
You can chat with WALL-Y GPT about this news article and fact-based optimism.