Why does direct LLM prompting fail for clinical code selection?

Large language models are not reliably trained on the full, versioned clinical vocabularies (e.g., SNOMED CT, ICD-10). They hallucinate codes that do not exist in official systems, creating compliance and data integrity risks. Retrieval-augmented approaches ground the model in a curated corpus of real codes, eliminating out-of-vocabulary hallucinations.

How does RASC reduce the complexity of code selection?

RASC shrinks the effective output space from thousands of codes to a smaller candidate pool by first retrieving similar existing value sets from a corpus. The classifier then operates on this reduced set, improving both accuracy and computational efficiency. This two-stage approach achieves higher precision than direct generation or retrieval alone.

What is the performance difference between RASC and GPT-4o?

RASC's fine-tuned cross-encoder achieves F1 0.298 and AUROC 0.852, while GPT-4o zero-shot achieves F1 0.105. Additionally, 48.6% of GPT-4o's returned codes do not exist in the official VSAC vocabulary. RASC's advantage grows as value set size increases, making it more reliable for complex clinical concepts.

← İçerik

Yapay Zeka · 4 dk okuma · 17 Nisan 2026

Retrieval-Augmented Set Completion for Clinical Code Authoring

A two-stage approach retrieves similar clinical value sets then classifies candidates, outperforming direct LLM generation on standardized medical vocabularies.

Kaynak: arxiv/cs.LG · Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons · orijinali aç ↗ ↗

Paylaş: X LinkedIn

Retrieve similar clinical value sets, then classify candidates with a fine-tuned model, reducing hallucination and improving code selection accuracy.

— Clinical value set authoring identifies all codes representing a medical concept in standardized vocabularies.
— Direct LLM prompting fails because vocabularies are large, versioned, and not reliably memorized.
— RASC retrieves K similar existing sets from a corpus, then applies a classifier to each candidate.
— Cross-encoder fine-tuned on SAPBert achieves AUROC 0.852, outperforming MLP (0.799) and GPT-4o zero-shot (F1 0.105).
— GPT-4o returns 48.6% codes absent from the official vocabulary, indicating hallucination.
— Retrieval-only baseline produces 12.3 irrelevant codes per true positive; classifiers reduce this to 3.2–4.4.
— Performance gap widens as value set size increases, confirming theoretical advantage of shrinking output space.
— Benchmark dataset of 11,803 VSAC value sets enables reproducible evaluation.

Sık sorulanlar

Large language models are not reliably trained on the full, versioned clinical vocabularies (e.g., SNOMED CT, ICD-10). They hallucinate codes that do not exist in official systems, creating compliance and data integrity risks. Retrieval-augmented approaches ground the model in a curated corpus of real codes, eliminating out-of-vocabulary hallucinations.

#clinical #retrieval #classification #healthcare #llm #benchmark

Retrieval-Augmented Set Completion for Clinical Code Authoring

Sık sorulanlar

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs