What are hallucination accumulation and desynchronization in LLM-driven research software?

Hallucination accumulation occurs when unsupported claims made by an LLM in one session are treated as fact in later sessions, propagating errors. Desynchronization happens when code, mathematical theory, and the model's internal understanding of the project fall out of alignment, causing the model to generate inconsistent or contradictory outputs. Both arise because LLMs lack persistent workspace state across sessions.

How does Comet-H decide which prompt to run next?

Comet-H frames prompt selection as a contextual bandit problem. A controller scores available prompts against the current workspace state—what is missing, incomplete, or misaligned. It selects the prompt with the highest score, carries forward unfinished work with a decay function, and re-validates documentation against code whenever changes occur. The scoring is transparent and hand-weighted, not learned.

What results did the A3 static-analysis tool achieve using Comet-H?

A3, a Python static-analysis tool built entirely within the Comet-H loop, reached an F1 score of 0.768 on a 90-case benchmark, compared to a next-best baseline of 0.364. Across approximately 400 commits, audit-and-contraction passes—cycles that verify consistency between code and claims—dominated the later phases of every successful project trajectory.

← Content

AI · 8 min read · May 1, 2026

LLMs Need Feedback Loops to Keep Code and Theory Aligned

Researchers propose Comet-H, a system that orchestrates language models through iterative cycles to prevent hallucination and desynchronization in research software development.

Source: arxiv/cs.AI · Halley Young, Nikolaj Bj\"orner · open original ↗ ↗

Share: X LinkedIn

LLMs drift when code, theory, and claims evolve separately; Comet-H couples them via iterative prompting and workspace state tracking.

— LLMs generate code and text well but struggle when specifications change mid-project.
— Hallucination accumulation: unsupported claims propagate across sessions without grounding.
— Desynchronization: code, theory, and the model's internal world model fall out of sync.
— Comet-H uses a contextual bandit approach to select prompts based on workspace deficits.
— A controller tracks unfinished work with a decay function and re-validates docs against code.
— A3 static-analysis tool built entirely within Comet-H reached F1=0.768 vs 0.364 baseline.
— Audit-and-contraction passes dominate successful project trajectories in later phases.
— Transparent scoring and fading work records make each prompt choice legible and bounded.

Frequently asked

Hallucination accumulation occurs when unsupported claims made by an LLM in one session are treated as fact in later sessions, propagating errors. Desynchronization happens when code, mathematical theory, and the model's internal understanding of the project fall out of alignment, causing the model to generate inconsistent or contradictory outputs. Both arise because LLMs lack persistent workspace state across sessions.

#language-models #research-software #code-generation #alignment #automation

LLMs Need Feedback Loops to Keep Code and Theory Aligned

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs