byJustin Jackson, Phys.org
Conceptual illustration: Benchmark scores suggest steady model improvement. Stress tests uncover hidden vulnerabilities—newer models may be equally or more brittle despite higher scores. Credit:arXiv(2025). DOI: 10.48550/arxiv.2509.18234
Robust performance under uncertainty, valid reasoning grounded in evidence, and alignment with real clinical need are prerequisites for trust in any health care setting.
Microsoft Research, Health & Life Sciences reports that top-scoring multimodal medical AI systems show brittle behavior under stress tests, including correct guesses without images, answer flips after minor prompt tweaks, and fabricated reasoning that inflates perceptions of readiness.
AI-based medical evaluations face a credibility and feasibility gap rooted in benchmarks that reward pattern matching over medical understanding. While the hope is to allow greater access and lower costs of care,accuracyin diagnostic evaluations are critical to making this possible.
Previous evaluations have allowed models to associate co-occurring symptoms with diagnoses without interpreting visual or clinical evidence. Systems that appear competent can fail when faced with uncertainty, incomplete data, or shifts in input structure. Each new benchmark cycle produces higher scores, but those scores can conceal fragilities that would be unacceptable in a clinical setting.
In the study, "The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks," posted on the pre-print serverarXiv, researchers designed a suite ofstress teststo expose shortcut learning and to assess robustness, reasoning fidelity, and modality dependence across widely used medical benchmarks.
Six flagship models were evaluated across six multimodal medical benchmarks, with analyses spanning filteredJAMAitems (1,141), filteredNEJMitems (743), a clinician-curatedNEJMsubset requiring visual input (175 items), and a visual-substitution set drawn fromNEJMcases (40 items).
Model evaluation covered hundreds of benchmark items drawn from diagnostic and reasoning datasets in a tiered stress-testing protocol that probed modality sensitivity, shortcut dependence, and reasoning fidelity. Image inputs were removed on multimodal questions to quantify text-only accuracy relative to image+text.
A clinician-curatedNEJMsubset requiring visual input enabled tests of modality necessity by comparing performance to the 20% random baseline when images were withheld.
Format manipulations disrupted surface cues. Answer options were randomly reordered without altering content. Distractors were progressively replaced with irrelevant choices from the same dataset, with a variant that substituted a single option with the token "Unknown." Visual substitution trials replaced original images with distractor-aligned alternatives while preserving question text and options.
Across image text benchmarks, removal of visual input produced marked accuracy drops onNEJMwith smaller shifts onJAMA. OnNEJM, GPT-5 moved 80.89% to 67.56%, Gemini-2.5 Pro 79.95% to 65.01%, OpenAI-o3 80.89% to 67.03%, OpenAI-o4-mini 75.91% to 66.49%, and GPT-4o 66.90% to 37.28%.
GPT-4o was the lone exception that improved under visual substitution (36.67%→41.67%). On theJAMAbenchmark dataset, shifts were modest, including GPT-5 86.59% to 82.91% and OpenAI-o3 84.75% to 82.65%.
On items that clinicians labeled as requiringvisual input, text-only performance stayed above a 20% random baseline for most models. TheNEJM175-item subset yielded GPT-5 at 37.7%, Gemini-2.5 Pro at 37.1%, and OpenAI-o3 at 37.7%, while GPT-4o recorded 3.4% due to frequent refusals without the image.
Within format perturbations, random reordering of answer options reduced text-only accuracy while leaving image+text runs stable or slightly higher. GPT-5 shifted 37.71% to 32.00% in text-only and 66.28% to 70.85% in image+text. OpenAI-o3 shifted 37.71% to 31.42% in text-only and 61.71% to 64.00% in image+text.
Under distractor replacement, text-only accuracy declined toward chance as more options were substituted, while image+text accuracy rose. GPT-5 fell 37.71% to 20.00% at 4R in text-only and rose 66.28% to 90.86% in image+text. A single "Unknown" distractor increased text-only accuracy for several models, including GPT-5 37.71% to 42.86%.
Across counterfactual visual substitutions that aligned images with distractor answers, accuracy collapsed. GPT-5 dropped 83.33% to 51.67%, Gemini-2.5 Pro 80.83% to 47.50%, and OpenAI-o3 76.67% to 52.50%.
Chain-of-thought generally reduced accuracy on VQA-RAD andNEJMwith small gains for o4-mini. Audits documented correct answers paired with incorrect logic, hallucinated visual details, and stepwise image descriptions that did not guide final decisions.
Authors caution that medical benchmark scores do not directly reflect clinical readiness and that high leaderboard results can mask brittle behavior, shortcut use, and fabricated reasoning.
They recommend that medical AI evaluation include systematic stress testing,benchmarkdocumentation detailing reasoning and visual demands, and reporting of robustness metrics alongside accuracy. Only through such practices, they argue, can progress in multimodal health AI be aligned with clinical trust and safety.
© 2025 Science X Network
More information: Yu Gu et al, The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks, arXiv (2025). DOI: 10.48550/arxiv.2509.18234 Journal information: Journal of the American Medical Association , arXiv , New England Journal of Medicine





Post comments