← Content
Engineering · 4 min read · April 21, 2026

Kernel-Level LLM Safety via Logit Inspection

ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.

Source: arxiv/cs.LG · Daeyeon Son · open original ↗ ↗
Share: X LinkedIn

A kernel primitive that inspects LLM logits before token generation to classify and block unsafe outputs without learned parameters.

  • ProbeLogits performs one forward pass and reads specific token logits to detect unsafe agent actions.
  • Achieves 97–99% block rate on HarmBench and F1=0.812 on ToxicChat, matching or exceeding Llama Guard 3.
  • Runs 2.5x faster than token-generation classifiers; bare-metal latency is 65 ms.
  • Uses calibration strength alpha as a deployment-time policy knob instead of learned weights.
  • Implemented in Anima OS (86k lines Rust); operates below WASM sandbox, harder to circumvent.
  • Contextual calibration corrects verbalizer bias asymmetry across model and prompt pairs.
  • Tested on Qwen 2.5-7B, Llama 3 8B, and Mistral 7B with three external benchmarks.

Frequently asked

  • ProbeLogits reads a single logit value before token generation, while Llama Guard 3 generates a full classification token sequence. ProbeLogits is 2.5x faster and requires no learned parameters, only a calibration scalar (alpha). On ToxicChat, ProbeLogits achieves F1=0.812 versus Llama Guard 3's baseline, with some model-verbalizer pairs exceeding it by 4.4 percentage points.

Related