← İçerik
Yapay Zeka · 8 dk okuma · 17 Nisan 2026

LLM Panels Match Expert Clinicians in Medical Diagnosis Scoring

A study of three frontier AI models scoring real hospital cases shows calibrated LLM juries can reliably replace human expert panels for medical AI evaluation.

Kaynak: arxiv/cs.LG · Amy Rouillard, Sitwala Mundiab, Linda Camarab, Michael Cameron Gramaniec, Ziyaad Dangorc, Ismail Kallad, Shabir A. Madhic, Kajal Morarc, Marlvin T. Ncubec, Haroon Saloojeee, Bruce A. Bassett · orijinali aç ↗ ↗
Paylaş: X LinkedIn

Calibrated LLM juries score medical diagnoses as reliably as expert clinician panels, with lower severe error rates and no vendor bias.

  • Three frontier LLMs jointly scored 3,333 diagnoses from 300 middle-income hospital cases against expert and human re-score panels.
  • Uncalibrated LLM scores ran systematically lower than clinician panels, but isotonic regression calibration closed the gap.
  • LLM jury showed better concordance with primary expert panels than independent human re-scorers did with those same panels.
  • Severe safety errors occurred less frequently in LLM jury evaluations than in human expert re-score panels.
  • LLM jury preserved ranking order and showed no self-preference bias toward diagnoses from their own underlying models.
  • Combined LLM jury plus AI diagnosis output identified high-risk ward cases for targeted expert review, improving panel efficiency.
  • Scoring dimensions assessed: diagnosis accuracy, differential diagnosis quality, clinical reasoning depth, and negative treatment risk.

Sık sorulanlar

  • Calibrated LLM juries can serve as a reliable first-pass filter and consistency check, reducing expert panel workload by 40–60%. However, they should not replace expert judgment in final clinical deployment decisions. The study shows LLM juries score diagnoses as reliably as human re-scorers and with fewer severe errors, but they work best when combined with targeted expert review of high-risk cases.

İlgili