How does RGPO differ from PPO's clipping mechanism?

PPO clips importance ratios to a fixed range (e.g., [0.8, 1.2]) uniformly across all samples. RGPO learns a differentiable gate function that varies per sample based on its importance ratio, allowing the optimizer to adaptively decide which samples to trust. The gate participates in gradient computation, whereas PPO's clipping is a static heuristic applied before gradients.

Why does RGPO reduce variance when importance sampling ratios are heavy-tailed?

Heavy-tailed ratios (extreme outliers) cause importance sampling variance to diverge because the squared ratio term explodes. RGPO's gate function g(r) bounds the effective gradient weight, capping the influence of outlier samples. This prevents any single sample from dominating the gradient, keeping variance finite even when raw importance ratios are extreme.

Can RGPO be applied to offline reinforcement learning?

The paper focuses on online preference fine-tuning for language models. Offline RL typically uses different variance-control strategies (e.g., conservative Q-learning, behavior cloning regularization). RGPO's gate mechanism could theoretically apply to offline settings, but the paper does not explore this. Practitioners would need to adapt the dual-ratio gate design to offline data distributions.

← İçerik

Yapay Zeka · 5 dk okuma · 17 Nisan 2026

Rejection-Gated Policy Optimization replaces importance weighting with learned gates

A new reinforcement learning method selects trustworthy samples via differentiable gates instead of reweighting all samples, reducing variance and improving RLHF alignment.

Kaynak: arxiv/cs.LG · Ziwu Sun, Zhen Gao, Jiyong Zhang, Jiaheng Li · orijinali aç ↗ ↗

Paylaş: X LinkedIn

RGPO gates samples during policy updates rather than reweighting them, reducing variance and improving language model alignment.

— Replaces importance sampling ratios with learned acceptance gates that filter samples during gradient computation.
— Provides unified framework showing TRPO, PPO, REINFORCE as special cases of gate function choices.
— Bounds gradient variance even when importance ratios are heavy-tailed, where standard importance sampling fails.
— Achieves higher reward and lower KL divergence than PPO-RLHF in Qwen2.5 fine-tuning experiments.
— Uses dual-ratio gate anchoring to both previous policy and reference model for preference alignment.
— Maintains PPO computational cost without requiring second-order optimization.
— Incurs only bounded, controllable bias while providing approximate monotonic improvement guarantee.

Sık sorulanlar

PPO clips importance ratios to a fixed range (e.g., [0.8, 1.2]) uniformly across all samples. RGPO learns a differentiable gate function that varies per sample based on its importance ratio, allowing the optimizer to adaptively decide which samples to trust. The gate participates in gradient computation, whereas PPO's clipping is a static heuristic applied before gradients.

#reinforcement-learning #policy-optimization #variance-reduction #rlhf #gradient-estimation

Rejection-Gated Policy Optimization replaces importance weighting with learned gates

Sık sorulanlar

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs