DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge

Community Article Published February 7, 2025

1. Introduction

2. The Naive Approach of Only Using Reward: What’s the Problem?
Mathematical Correspondence

3. Introducing the Critic: Using a “Predicted Score Line” to Improve Rewards
Mathematical Correspondence

4. Adding Clip and Min Operations: Preventing Over-Updates
Mathematical Correspondence

5. Reference Model: Preventing Cheating and Extreme Strategies
Mathematical Correspondence

6. GRPO: Replacing the Value Function with “Multiple Simulated Averages”
Mathematical Correspondence

7. Conclusion: Reflection and Future Prospects

1. Introduction

In Reinforcement Learning (RL), simply knowing “how many points you score” often isn’t enough. Pursuing high scores alone can lead to various side effects, such as excessive exploration, instability in the model, or even “shortcutting” behaviors that deviate from reasonable policies. To address these challenges, RL incorporates several mechanisms—such as the Critic (value function), Clip operation, Reference Model, and the more recent Group Relative Policy Optimization (GRPO).

To make these concepts more intuitive, let’s draw an analogy: think of the RL training process as an elementary school exam scenario. We (the model being trained) are like students trying to get high grades, the teacher who grades our exams are like the reward model, while our parents handing out pocket money based on our grades is similar to the Critic. Next, let’s walk step by step through why final scores alone are insufficient, how Critic, Clip, and Reference Model come into play, and finally how GRPO extends these ideas.

2. The Naive Approach of Only Using Reward: What’s the Problem?

Suppose my younger brother and I are in the same elementary school class. The teacher grades our exams and gives an “absolute score.” I typically score above 80 out of 100, while my brother often gets around 30. We then take these scores directly to our dad to ask for pocket money—meaning our “reward” (in RL terms) is simply our raw exam score. Whoever gets a higher score receives more pocket money.

At first glance, that seems fine. But two big issues quickly arise:

Unfairness: If my brother improves from 30 to 60 points through a lot of hard work, he still pales in comparison to my usual 80+. He doesn’t get the encouragement he deserves.
Instability: Chasing higher scores myself could lead me to extreme study methods (e.g., cramming at all hours, staying up very late). Sometimes I might get 95, other times only 60, so my score—and hence the reward signal—fluctuates dramatically.

As a result, using absolute scores as Reward causes large reward fluctuations, and my brother ends up feeling it’s not worth trying to improve in small increments.

Mathematical Correspondence

In RL, if we simply do:

$\mathcal{J}_{\text{naive}}(\theta) = \mathbb{E}_{(q, o) \sim (\text{data}, \pi_{\theta})}\big[r(o)\big],$

which means “optimize only the final reward,” we can run into high variance and insufficient incentives for partial improvements. In other words, the Actor lacks a baseline that matches its own current level, and that hinders training efficiency.

3. Introducing the Critic: Using a “Predicted Score Line” to Improve Rewards

Recognizing this problem, Dad realizes that “it’s not just about the absolute score; it’s about how much you’ve improved relative to your current level.”

So he decides:

Set my “predicted score line” at 80 points and my brother’s at 40. If we exceed these lines on an exam, we get more pocket money; if not, we get very little or none.

Hence, if my brother works hard and jumps from 30 to 60, he’s 20 points above his “predicted score line,” which translates into a hefty reward. Meanwhile, if I remain around the 80s, the incremental gain is smaller, so I won’t necessarily receive much more than he does. This arrangement encourages each person to improve from their own baseline instead of purely comparing absolute scores.

Of course, Dad is busy, so once a line is set, it doesn’t just remain static—he needs to keep “readjusting” as we progress. If my brother levels up to the 60 range, then a 40-point baseline is no longer fair. Likewise, if I consistently hover around 85, Dad might need to tweak my line as well. In other words, Dad also has to learn, specifically about the pace at which my brother and I are improving.

Mathematical Correspondence

In RL, this “score line” is known as the value function, $V_{\psi}(s)$ . It acts as a baseline. Our training objective evolves from “just reward” to “how much we outperform that baseline,” expressed by the Advantage:

$A_t = r_t - V_{\psi}(s_t).$

For a given state $s_{t}$ and action $o_{t}$ , if the actual reward exceeds the Critic’s expectation, it means the action performed better than predicted. If it’s lower, that action underperformed. In the simplest formulation, we optimize something like:

$\mathcal{J}_{\text{adv}}(\theta) = \mathbb{E}\big[A(o)\big], \quad \text{where } A(o) = r(o) - V_{\psi}(o).$

By subtracting this “score line,” we reduce variance in training, giving higher gradient signals to actions that exceed expectations and penalizing those that fall short.

4. Adding Clip and Min Operations: Preventing Over-Updates

Even with the “score line,” new problems can emerge. For instance:

If I suddenly break through on a test and score 95 or 100, Dad might give me a huge reward, pushing me to adopt overly aggressive study patterns before the next exam. My grades might swing between extremes (95 and 60), causing massive reward volatility.

Thus, Dad decides to moderate how drastically I can update my study strategy in each step—he won’t give me exponentially more pocket money just because of one good test. If he gives too much, I might veer into extreme exploration; if too little, I won’t be motivated. So he must find a balance.

Mathematical Correspondence

In PPO (Proximal Policy Optimization), this balance is achieved through the “Clip” mechanism. The core of the PPO objective includes:

$\min \Big(r_t(\theta) A_t,\ \text{clip}\big(r_t(\theta), 1 - \varepsilon,\, 1 + \varepsilon\big)\,A_t\Big),$

where

$r_t(\theta) = \frac{\pi_{\theta}(o_t\mid s_t)}{\pi_{\theta_{\text{old}}}(o_t\mid s_t)},$

represents the probability ratio between the new and old policies for that action. If the ratio deviates too far from 1, it’s clipped within $\bigl[\,1-\varepsilon,\ 1+\varepsilon\bigr]$ , which limits how much the policy can shift in one update.

In simpler terms:

Scoring 100 gets me extra rewards, but Dad imposes a “ceiling” so I don’t go overboard. He’ll then reassess on the next exam, maintaining a steady approach rather than fueling extreme fluctuations.

5. Reference Model: Preventing Cheating and Extreme Strategies

Even so, if I’m solely fixated on high scores, I might resort to questionable tactics—for instance, cheating or intimidating the teacher into awarding me a perfect score. Clearly, that breaks all rules. In the realm of large language models, an analogous scenario is producing harmful or fabricated content to artificially boost some reward metric.

Dad, therefore, sets an additional rule:

“No matter what, you can’t deviate too much from your original, honest approach to studying. If you’re too far off from your baseline, even with a high score, I’ll disqualify you and withhold your pocket money.”

That’s akin to marking down a “reference line” from the start of the semester (i.e., after initial supervised fine-tuning). You can’t stray too far from that original strategy or you face penalties.

Mathematical Correspondence

In PPO, this is reflected by adding a KL penalty against the Reference Model (the initial policy). Concretely, we include something like:

$-\beta\, \mathbb{D}_{\mathrm{KL}}\big(\pi_{\theta}\,\|\ \pi_{\text{ref}}\big)$

in the loss. This keeps the Actor from drifting too far from the original, sensible policy, avoiding “cheating” or other drastically out-of-bounds behaviors.

6. GRPO: Replacing the Value Function with “Multiple Simulated Averages”

One day, Dad says, “I don’t have time to keep assessing your learning progress and draw new score lines all the time. Why not do five sets of simulated tests first, then take their average score as your expected score? If you surpass that average on the real test, it shows you did better than your own expectations, so I’ll reward you. Otherwise, you won’t get much.” My brother and I, and potentially more classmates, can each rely on a personal set of simulated tests rather than an external “value network” that Dad would have to constantly adjust.

Up until now, we saw that PPO relies on the Actor + Critic + Clip + KL penalty framework. However, in large language model (LLM) scenarios, the Critic (value function) often needs to be as large as the Actor to accurately evaluate states, which can be costly and sometimes impractical—especially if you only have a single final reward at the end (like a final answer quality).

Hence, Group Relative Policy Optimization (GRPO) steps in. Its core idea:

No separate value network for the Critic,
Sample multiple outputs from the old policy for the same question or state,
Treat the average reward of these outputs as the baseline,
Anything above average yields a “positive advantage,” anything below yields a “negative advantage.”

Meanwhile, GRPO retains PPO’s Clip and KL mechanisms to ensure stable, compliant updates.

Mathematical Correspondence

According to DeepSeekMath’s technical report, the GRPO objective (omitting some symbols) is:

$\begin{aligned} \mathcal{J}_{GRPO}(\theta) = \mathbb{E}\Bigg[ & \sum_{i = 1}^{G}\Bigg(\min \Bigg(\frac{\pi_{\theta}\left(o_{i}\right)}{\pi_{\theta_{\text{old}}}\left(o_{i}\right)} A_{i},\ \text{clip}\Big(\frac{\pi_{\theta}\left(o_{i}\right)}{\pi_{\theta_{\text{old}}}\left(o_{i}\right)}, 1-\varepsilon, 1+\varepsilon\Big) A_{i}\Bigg) \\ & \quad -\ \beta\ \mathbb{D}_{KL}\left(\pi_{\theta}\ \|\ \pi_{\text{ref}}\right)\Bigg) \Bigg], \end{aligned}$

where

$A_{i} = \frac{r_{i} - \mathrm{mean}(\{r_1, r_2, \cdots, r_G\})}{\mathrm{std}(\{r_1, r_2, \cdots, r_G\})}$

calculates a “relative score” by averaging multiple outputs from the same question and normalizing. In this way, we no longer need a dedicated value function, yet we still get a dynamic “score line” that simplifies training and conserves resources.

7. Conclusion: Reflection and Future Prospects

Using the elementary school exam analogy, we’ve moved step by step from raw absolute scores to PPO’s full mechanism (Critic, Advantage, Clip, Reference Model), and finally to GRPO (leveraging multiple outputs’ average scores to eliminate the value function). Below are some key takeaways:

Role of the Critic: Provides a “reasonable expectation” for each state, significantly reducing training variance.
Clip & min Mechanism: Constrains the update magnitude, preventing overreacting to a single “breakthrough” exam.
Reference Model: Discourages “cheating” or extreme deviations, ensuring the policy remains reasonably aligned with its initial state.
Advantages of GRPO: In large language models, it removes the need for a separate value network, reducing memory and compute costs while aligning well with “comparative” Reward Model designs.

Much like how Dad switched to “let the kids simulate multiple exams themselves, then treat their average as the baseline,” GRPO avoids maintaining a massive Critic while still offering a relative reward signal. It preserves the stability and compliance features of PPO but streamlines the process.

I hope this article helps you naturally grasp PPO and GRPO. In practice, if you’re interested in topics like Process Supervision or Iterative RL, keep an eye on my blog for more updates.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote