Engineering · 4 min read · April 21, 2026
Kernel-Level LLM Safety via Logit Inspection
ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.
A kernel primitive that inspects LLM logits before token generation to classify and block unsafe outputs without learned parameters.
- — ProbeLogits performs one forward pass and reads specific token logits to detect unsafe agent actions.
- — Achieves 97–99% block rate on HarmBench and F1=0.812 on ToxicChat, matching or exceeding Llama Guard 3.
- — Runs 2.5x faster than token-generation classifiers; bare-metal latency is 65 ms.
- — Uses calibration strength alpha as a deployment-time policy knob instead of learned weights.
- — Implemented in Anima OS (86k lines Rust); operates below WASM sandbox, harder to circumvent.
- — Contextual calibration corrects verbalizer bias asymmetry across model and prompt pairs.
- — Tested on Qwen 2.5-7B, Llama 3 8B, and Mistral 7B with three external benchmarks.
Frequently asked
- ProbeLogits reads a single logit value before token generation, while Llama Guard 3 generates a full classification token sequence. ProbeLogits is 2.5x faster and requires no learned parameters, only a calibration scalar (alpha). On ToxicChat, ProbeLogits achieves F1=0.812 versus Llama Guard 3's baseline, with some model-verbalizer pairs exceeding it by 4.4 percentage points.