arxiv:2502.16944

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

Published on Feb 24

· Submitted by

keanudicap on Feb 28

Upvote

Authors:

Chenghua Huang ,

Lu Wang ,

Fangkai Yang ,

Zhixu Li ,

Qingwei Lin ,

Dongmei Zhang ,

Abstract

Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose Decoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained global value model (GVM). The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

View arXiv page View PDF Add to collection

Community

keanudicap

Paper author Paper submitter about 14 hours ago

We tackle the computational and stability challenges of traditional PPO-based RLHF by introducing Decoupled Value Policy Optimization (DVPO). Our approach pretrains a Global Value Model (GVM) to predict token-level return-to-go values from policy trajectories, eliminating the need for joint actor-critic training while preserving fine-grained reward supervision. Theoretically, we show that without new reward feedback, pretraining a reward model and a GVM are equivalent. Experiments on benchmarks like MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches state-of-the-art performance while reducing training time and GPU usage by approximately 40% and 35%, respectively.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.16944 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.16944 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.16944 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.