Why does RASC outperform direct LLM generation for clinical codes?

LLMs do not reliably memorize large, versioned clinical vocabularies. RASC retrieves similar existing value sets to form a candidate pool, shrinking the effective output space. A classifier then ranks candidates against ground truth, avoiding hallucinated codes. This two-stage approach reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.

What is a clinical value set and why does it matter?

A clinical value set is a curated list of standardized medical codes (e.g., ICD-10, SNOMED CT) that define a single clinical concept, such as 'diabetes diagnosis' or 'heart failure medication.' Accurate value sets are critical for quality measurement, patient phenotyping, and clinical research. Incomplete or incorrect sets lead to missed cases and biased outcomes.

How much better is the cross-encoder classifier than GPT-4o?

The cross-encoder fine-tuned on SAPBert achieves value-set-level F1 of 0.298 versus GPT-4o's 0.105. Additionally, 48.6% of GPT-4o's returned codes do not exist in the official VSAC vocabulary at all. The performance gap widens for larger value sets, confirming that retrieval-grounded classification is more reliable than zero-shot generation.

← Content

AI · 4 min read · April 17, 2026

Retrieval beats memorization for clinical code selection

A two-stage retrieval-then-classify method outperforms direct LLM generation for assembling clinical value sets from large standardized vocabularies.

Source: arxiv/cs.LG · Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons · open original ↗ ↗

Share: X LinkedIn

Retrieve similar existing value sets, then classify candidates to build clinical code lists more accurately than direct LLM generation.

— Clinical value set authoring identifies all codes defining a medical concept in standardized vocabularies.
— LLMs fail to reliably recall large, versioned clinical vocabularies learned during pretraining.
— RASC retrieves K similar existing value sets, then applies a classifier to rank candidate codes.
— Cross-encoder on SAPBert achieves AUROC 0.852 and F1 0.298, beating zero-shot GPT-4o at F1 0.105.
— Retrieval-then-classify reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.
— Performance gap widens as value set size increases, confirming theoretical advantage of shrinking output space.
— Benchmark created from 11,803 publicly available VSAC value sets, first large-scale dataset for task.
— Gains replicate across SAPBert cross-encoder, LightGBM, and other classifier architectures.

Frequently asked

LLMs do not reliably memorize large, versioned clinical vocabularies. RASC retrieves similar existing value sets to form a candidate pool, shrinking the effective output space. A classifier then ranks candidates against ground truth, avoiding hallucinated codes. This two-stage approach reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.

#llm #clinical #retrieval #classification #healthcare #vocabulary

Retrieval beats memorization for clinical code selection

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs