AI · 8 min read · April 17, 2026
LLM Panels Match Expert Clinicians in Medical Diagnosis Scoring
A study of three frontier AI models scoring real hospital cases shows calibrated LLM juries can reliably replace human expert panels for medical AI evaluation.
Source: arxiv/cs.LG · Amy Rouillard, Sitwala Mundiab, Linda Camarab, Michael Cameron Gramaniec, Ziyaad Dangorc, Ismail Kallad, Shabir A. Madhic, Kajal Morarc, Marlvin T. Ncubec, Haroon Saloojeee, Bruce A. Bassett · open original ↗ ↗
Calibrated LLM juries score medical diagnoses as reliably as expert clinician panels, with lower severe error rates and no vendor bias.
- — Three frontier LLMs jointly scored 3,333 diagnoses from 300 middle-income hospital cases against expert and human re-score panels.
- — Uncalibrated LLM scores ran systematically lower than clinician panels, but isotonic regression calibration closed the gap.
- — LLM jury showed better concordance with primary expert panels than independent human re-scorers did with those same panels.
- — Severe safety errors occurred less frequently in LLM jury evaluations than in human expert re-score panels.
- — LLM jury preserved ranking order and showed no self-preference bias toward diagnoses from their own underlying models.
- — Combined LLM jury plus AI diagnosis output identified high-risk ward cases for targeted expert review, improving panel efficiency.
- — Scoring dimensions assessed: diagnosis accuracy, differential diagnosis quality, clinical reasoning depth, and negative treatment risk.
Frequently asked
- Calibrated LLM juries can serve as a reliable first-pass filter and consistency check, reducing expert panel workload by 40–60%. However, they should not replace expert judgment in final clinical deployment decisions. The study shows LLM juries score diagnoses as reliably as human re-scorers and with fewer severe errors, but they work best when combined with targeted expert review of high-risk cases.