How does ProbeLogits differ from Llama Guard 3?

ProbeLogits reads a single logit value before token generation, while Llama Guard 3 generates a full classification token sequence. ProbeLogits is 2.5x faster and requires no learned parameters, only a calibration scalar (alpha). On ToxicChat, ProbeLogits achieves F1=0.812 versus Llama Guard 3's baseline, with some model-verbalizer pairs exceeding it by 4.4 percentage points.

What is the alpha parameter and how is it set?

Alpha is a calibration strength scalar that adjusts the decision boundary for classifying logits as safe or unsafe. It is set at deployment time based on policy, not learned during training. Different model-verbalizer pairs require different alpha values to correct for prior asymmetry in the logit distribution.

Why does running ProbeLogits in the OS kernel make it harder to bypass?

ProbeLogits operates below the WASM sandbox boundary in Anima OS, meaning agent actions must pass through 15 kernel-mediated host functions before execution. This makes it significantly harder to circumvent than application-layer classifiers, which can be patched or disabled by user code.

← Content

Engineering · 4 min read · April 21, 2026

Kernel-Level LLM Safety via Logit Inspection

ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.

Source: arxiv/cs.LG · Daeyeon Son · open original ↗ ↗

Share: X LinkedIn

A kernel primitive that inspects LLM logits before token generation to classify and block unsafe outputs without learned parameters.

— ProbeLogits performs one forward pass and reads specific token logits to detect unsafe agent actions.
— Achieves 97–99% block rate on HarmBench and F1=0.812 on ToxicChat, matching or exceeding Llama Guard 3.
— Runs 2.5x faster than token-generation classifiers; bare-metal latency is 65 ms.
— Uses calibration strength alpha as a deployment-time policy knob instead of learned weights.
— Implemented in Anima OS (86k lines Rust); operates below WASM sandbox, harder to circumvent.
— Contextual calibration corrects verbalizer bias asymmetry across model and prompt pairs.
— Tested on Qwen 2.5-7B, Llama 3 8B, and Mistral 7B with three external benchmarks.

Frequently asked

ProbeLogits reads a single logit value before token generation, while Llama Guard 3 generates a full classification token sequence. ProbeLogits is 2.5x faster and requires no learned parameters, only a calibration scalar (alpha). On ToxicChat, ProbeLogits achieves F1=0.812 versus Llama Guard 3's baseline, with some model-verbalizer pairs exceeding it by 4.4 percentage points.

#llm #safety #kernel #inference #governance

Kernel-Level LLM Safety via Logit Inspection

Frequently asked

Vibe Coding Triggers a Dopamine Loop That Undermines Engineering Judgment

Deterministic Routing Cuts Tail Latency by Aligning Requests With Data

How GCP Architects Should Actually Use Generative AI