How does Stealth Pretraining Seeding differ from standard data poisoning?

Standard poisoning injects obvious malicious examples into training data. SPS distributes tiny, benign-looking payloads across many obscure websites, relying on web crawlers to absorb them into pretraining corpora. Each payload is individually undetectable, but collectively they embed dormant logic that activates only when triggered by specific prompts. This makes the attack harder to catch during dataset construction.

What are the geometric diagnostics used to detect latent poisoning?

The paper introduces three diagnostic methods: Thermodynamic Length measures how the model's behavior changes under adversarial perturbations; Spectral Curvature examines the curvature of the loss landscape to identify hidden decision boundaries; and the Infection Traceback Graph traces which training examples or data sources contributed to suspicious model outputs. Together, they help reveal vulnerabilities that standard evaluations miss.

Can standard safety alignment defenses prevent PermaFrost attacks?

No. The paper shows that common alignment techniques (RLHF, constitutional AI, adversarial training) often fail to catch dormant poisoning because they evaluate the model on standard benchmarks and prompts. Since poisoned logic only activates via specific triggers, it remains invisible during typical safety testing. Defenders must adopt new evaluation methods that probe for conditional or latent behaviors.

← Content

AI · 8 min read · April 27, 2026

Poisoned Pretraining: Hidden Attacks Embedded in LLM Training Data

Researchers demonstrate how adversaries can plant dormant malicious logic in large language models by seeding poisoned content across obscure websites, evading detection until triggered.

Source: arxiv/cs.AI · Harsh Kumar, Rahul Maity, Tanmay Joshi, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das · open original ↗ ↗

Share: X LinkedIn

Adversaries can inject tiny poisoned payloads into web-crawled pretraining data, creating latent vulnerabilities that activate only when triggered by specific prompts.

— Stealth Pretraining Seeding distributes small malicious content across obscure websites exposed to web crawlers.
— Individual payloads remain small, diffuse, and superficially harmless to evade dataset filtering.
— Poisoned logic lies dormant during standard evaluation but activates via precise alphanumeric triggers.
— PermaFrost-Attack framework uses geometric diagnostics to detect latent model behavior and vulnerabilities.
— Attack succeeds across multiple model families and scales while often bypassing alignment defenses.
— Threat exploits the dependency of LLMs on web-scale pretraining from sources like Common Crawl.
— Standard safety evaluations miss dormant threats because they do not probe for hidden conditional logic.

Frequently asked

Standard poisoning injects obvious malicious examples into training data. SPS distributes tiny, benign-looking payloads across many obscure websites, relying on web crawlers to absorb them into pretraining corpora. Each payload is individually undetectable, but collectively they embed dormant logic that activates only when triggered by specific prompts. This makes the attack harder to catch during dataset construction.

#llm #security #poisoning #pretraining #adversarial

Poisoned Pretraining: Hidden Attacks Embedded in LLM Training Data

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs