AI · 4 min read · April 17, 2026
Retrieval beats memorization for clinical code selection
A two-stage retrieval-then-classify method outperforms direct LLM generation for assembling clinical value sets from large standardized vocabularies.
Source: arxiv/cs.LG · Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons · open original ↗ ↗
Retrieve similar existing value sets, then classify candidates to build clinical code lists more accurately than direct LLM generation.
- — Clinical value set authoring identifies all codes defining a medical concept in standardized vocabularies.
- — LLMs fail to reliably recall large, versioned clinical vocabularies learned during pretraining.
- — RASC retrieves K similar existing value sets, then applies a classifier to rank candidate codes.
- — Cross-encoder on SAPBert achieves AUROC 0.852 and F1 0.298, beating zero-shot GPT-4o at F1 0.105.
- — Retrieval-then-classify reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.
- — Performance gap widens as value set size increases, confirming theoretical advantage of shrinking output space.
- — Benchmark created from 11,803 publicly available VSAC value sets, first large-scale dataset for task.
- — Gains replicate across SAPBert cross-encoder, LightGBM, and other classifier architectures.
Frequently asked
- LLMs do not reliably memorize large, versioned clinical vocabularies. RASC retrieves similar existing value sets to form a candidate pool, shrinking the effective output space. A classifier then ranks candidates against ground truth, avoiding hallucinated codes. This two-stage approach reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.