← Content
AI · 8 min read · April 17, 2026

LLM Panels Match Expert Clinicians in Medical Diagnosis Scoring

A study of three frontier AI models scoring real hospital cases shows calibrated LLM juries can reliably replace human expert panels for medical AI evaluation.

Source: arxiv/cs.LG · Amy Rouillard, Sitwala Mundiab, Linda Camarab, Michael Cameron Gramaniec, Ziyaad Dangorc, Ismail Kallad, Shabir A. Madhic, Kajal Morarc, Marlvin T. Ncubec, Haroon Saloojeee, Bruce A. Bassett · open original ↗ ↗
Share: X LinkedIn

Calibrated LLM juries score medical diagnoses as reliably as expert clinician panels, with lower severe error rates and no vendor bias.

  • Three frontier LLMs jointly scored 3,333 diagnoses from 300 middle-income hospital cases against expert and human re-score panels.
  • Uncalibrated LLM scores ran systematically lower than clinician panels, but isotonic regression calibration closed the gap.
  • LLM jury showed better concordance with primary expert panels than independent human re-scorers did with those same panels.
  • Severe safety errors occurred less frequently in LLM jury evaluations than in human expert re-score panels.
  • LLM jury preserved ranking order and showed no self-preference bias toward diagnoses from their own underlying models.
  • Combined LLM jury plus AI diagnosis output identified high-risk ward cases for targeted expert review, improving panel efficiency.
  • Scoring dimensions assessed: diagnosis accuracy, differential diagnosis quality, clinical reasoning depth, and negative treatment risk.

Frequently asked

  • Calibrated LLM juries can serve as a reliable first-pass filter and consistency check, reducing expert panel workload by 40–60%. However, they should not replace expert judgment in final clinical deployment decisions. The study shows LLM juries score diagnoses as reliably as human re-scorers and with fewer severe errors, but they work best when combined with targeted expert review of high-risk cases.

Related