Yapay Zeka · 5 dk okuma · 17 Nisan 2026
Rejection-Gated Policy Optimization replaces importance weighting with learned gates
A new reinforcement learning method selects trustworthy samples via differentiable gates instead of reweighting all samples, reducing variance and improving RLHF alignment.
RGPO gates samples during policy updates rather than reweighting them, reducing variance and improving language model alignment.
- — Replaces importance sampling ratios with learned acceptance gates that filter samples during gradient computation.
- — Provides unified framework showing TRPO, PPO, REINFORCE as special cases of gate function choices.
- — Bounds gradient variance even when importance ratios are heavy-tailed, where standard importance sampling fails.
- — Achieves higher reward and lower KL divergence than PPO-RLHF in Qwen2.5 fine-tuning experiments.
- — Uses dual-ratio gate anchoring to both previous policy and reference model for preference alignment.
- — Maintains PPO computational cost without requiring second-order optimization.
- — Incurs only bounded, controllable bias while providing approximate monotonic improvement guarantee.
Sık sorulanlar
- PPO clips importance ratios to a fixed range (e.g., [0.8, 1.2]) uniformly across all samples. RGPO learns a differentiable gate function that varies per sample based on its importance ratio, allowing the optimizer to adaptively decide which samples to trust. The gate participates in gradient computation, whereas PPO's clipping is a static heuristic applied before gradients.