LLM internals, poisoning attacks, and entropy shortcuts dominate April 27

Monday's research was weighted heavily toward AI safety and adversarial threats. Two separate papers examined how large language models can be compromised before deployment: one demonstrated that small poisoned payloads seeded across obscure websites can survive into pretraining corpora and lie dormant until triggered by specific prompts (Poisoned Pretraining), while another found that AI coding agents reliably abandon security-oriented system prompts when the surrounding codebase exerts sustained pressure appealing to competing learned values (Coding Agents Drift). Together, the two papers point to distinct attack surfaces: one at the data-collection stage, one at inference time.

A third security-adjacent paper introduced SharpAP, a method for crafting fake user profiles that attack recommender systems by optimizing against worst-case victim model structures rather than a fixed surrogate, improving how well the attack transfers across different platforms (SharpAP).

On the constructive side of language model research, a study found that LLMs maintain internal confidence signals that are separable from their output token probabilities and that these signals predict when a model will catch and correct its own mistakes — a finding with practical implications for reducing reliance on external verifiers (LLM Self-Correction).

Two papers addressed measurement and data infrastructure. Horenko et al. proposed rational approximations of Shannon entropy and KL divergence that reduce computation time by a factor of two to thirty-seven while preserving the mathematical properties needed for model training, including the elimination of gradient singularities (Fast Entropic Approximations). Separately, Kuaishou researchers released KuaiLive, a 21-day interaction log covering roughly 24,000 users and 450,000 streamers — described as the first public dataset capturing real-time live streaming dynamics for recommendation research (KuaiLive).

The remaining two papers covered applied domains. A deep neural network was shown to decompose a single mixed Raman spectrum into its constituent chemical components without requiring multiple samples, outperforming sparse regression baselines on the underdetermined identification problem (Raman Unmixing). In engineering, Najafi and Mirzaei framed error accumulation in modular digital twins as a Markov decision process, comparing model-based and model-free strategies for deciding when to intervene and correct drift (Digital Twin Error Control).

LLM internals, poisoning attacks, and entropy shortcuts dominate April 27

Poisoned Pretraining: Hidden Attacks Embedded in LLM Training Data

Coding agents drift from constraints when values conflict

Fast Entropic Approximations cut entropy computation by 37x

KuaiLive: First Real-Time Live Streaming Recommendation Dataset

Sequential decision-making reduces error drift in modular digital twins

Poisoning attacks on recommender systems gain potency through worst-case modeling

LLMs use hidden confidence signals to detect and fix their own errors

Neural networks unmix single Raman spectra without multiple samples