Papers
arxiv:2502.16944

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

Published on Feb 24
· Submitted by keanudicap on Feb 28
Authors:
,
,

Abstract

Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose Decoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained global value model (GVM). The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

Community

Paper author Paper submitter

We tackle the computational and stability challenges of traditional PPO-based RLHF by introducing Decoupled Value Policy Optimization (DVPO). Our approach pretrains a Global Value Model (GVM) to predict token-level return-to-go values from policy trajectories, eliminating the need for joint actor-critic training while preserving fine-grained reward supervision. Theoretically, we show that without new reward feedback, pretraining a reward model and a GVM are equivalent. Experiments on benchmarks like MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches state-of-the-art performance while reducing training time and GPU usage by approximately 40% and 35%, respectively.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.16944 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.16944 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.16944 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.