← Content
AI · 8 min read · April 21, 2026

Simpler Optimizers Make LLM Unlearning More Robust

Research shows that using lower-order optimization methods during LLM unlearning produces forgetting that resists post-training attacks better than sophisticated gradient-based approaches.

Source: arxiv/cs.LG · Yicheng Lang, Yihua Zhang, Chongyu Fan, Changsheng Wang, Jinghan Jia, Sijia Liu · open original ↗ ↗
Share: X LinkedIn

Downgrading from advanced to simpler optimizers during LLM unlearning strengthens resistance to post-training manipulations.

  • LLM unlearning removes unwanted knowledge but remains fragile against weight quantization and fine-tuning.
  • Optimizer 'grade' (zeroth, first, second-order) directly affects how robust the unlearning becomes.
  • Zeroth-order and gradient-sign methods produce noisier updates that converge to harder-to-disturb loss landscape regions.
  • Noisy, imprecise updates paradoxically create more resilient forgetting than precise gradient-based methods.
  • Zeroth-order optimizers connect naturally to randomized smoothing, a known robustness technique.
  • Hybrid optimizer combining first and zeroth-order updates preserves unlearning quality while improving resilience.
  • Validation on MUSE and WMDP benchmarks across multiple unlearning algorithms confirms the approach.

Frequently asked

  • Noisy updates converge to flatter, more stable regions of the loss landscape. These basins are harder to perturb because small changes in weights do not significantly alter the model's behavior. In contrast, precise gradient-based methods can converge to sharp minima that are easily disrupted by post-training fine-tuning or quantization.

Related