open-r1/README · Seeking Clarification on GRPO's Core Mechanisms (for Independent Implementation)

Hi everyone! I'm working on implementing mathematical reasoning in LLMs from scratch, and I've been studying the DeepSeek-Math paper to learn from their approach. Their GRPO method is brilliant, and I have some questions about its core mechanisms that I'm hoping the community can help clarify:

Policy Model Interactions:
In Algorithm 1, outputs {o_i} are sampled from $\pi_{old}$(Step 7), but the gradient equation (20) and GRPO objective (21) use multiple different policies:

$\pi_{old}$ for sampling
$\pi_{ref}$ for KL divergence
$\pi_\theta$ for optimization

Q: How do these different policy roles interact during the μ iterations in Step 10? Since we're not generating new outputs during these iterations, how do we ensure the advantage is improving?

Group Relativity Implementation:
The algorithm name emphasizes "Relative," but:

Rewards are normalized within groups of outputs from the same policy ($\pi_{old}$)
No direct comparison between old and new policy rewards
KL divergence is computed against $\pi_{ref}$, not $\pi_{old}$

Q: What specifically makes this approach "relative" compared to standard PPO? Is it just the group-wise reward normalization?

Token vs Step-level Advantage:
The paper maintains token-level notation Âᵢ,ₜ throughout, but particularly for process supervision (4.1.3):
$\hat{𝐴}_{𝑖,𝑡} =\sum_{𝑖𝑛𝑑𝑒𝑥(𝑗)≥𝑡} \tilde{𝑟}^{𝑖𝑛𝑑𝑒𝑥(𝑗)}$

Given that:

Mathematical reasoning naturally occurs at the step level
The formula actually sums over steps (index(j))
Understanding math requires comprehending complete steps rather than individual tokens

Q: What was the reasoning behind maintaining token-level notation ($\hat{𝐴}{𝑖,𝑡}$) rather than step-level notation ($\hat{𝐴}{𝑖,step}$)? Does this choice have practical implications for implementation or is it primarily notational?

Implementation Details:
For Step 10's multiple iterations:

What components of the model are actually being updated?
How do you determine optimal number of iterations (μ)?
How do you validate improvement without generating new outputs?

I'm particularly interested in understanding these aspects as they're crucial for implementing similar approaches, even if at a smaller scale. Any insights would be greatly appreciated!