Questions about Verifier Development, Search as Data Generation Tool, and Model Family Alignment
Really fascinated by your research on test-time optimization! After reading through the implementation details, I have a few technical questions:
- Regarding PRMs and verifier strength:
- How do current PRMs handle problems with multiple valid solution approaches where initial steps might look very different?
- How are stronger verifiers currently being developed - is it primarily through diverse training data, or are there other architectural/methodological approaches?
- Could you expand on using search as a data generation tool? I'm particularly interested in:
- How the search process might generate more diverse/higher-quality training examples
- Whether the search-generated data could help improve verifier robustness
- Regarding model architecture:
- How important is it for the PRM and base model to share the same architecture family?
- Would using a PRM from a different model family significantly impact performance or model interaction?
Thanks for sharing such detailed insights into your approaches!
Hello @bird-of-paradise , thank you for the questions! Here's some partial answers:
How do current PRMs handle problems with multiple valid solution approaches where initial steps might look very different?
If you're scoring complete solutions with a PRM, you will generally find that solutions with correct answers AND reasoning are scored higher than correct answers with incorrect reasoning (e.g. due to hallucinations). As long at the initial steps are valid, the PRM should score them highly.
How are stronger verifiers currently being developed
This is still an under-explored topic, but the current best recipe is Math-Shepherd's, which uses MCTS to generate the stepwise annotations.
@plaguss
has implemented this in distilabel
if you're interested: https://distilabel.argilla.io/dev/sections/pipeline_samples/papers/math_shepherd/
Aside from that, better domain-specific base models are likely the key to having improved annotations.
How the search process might generate more diverse/higher-quality training examples
The basic idea here is that people typically generate synthetic datasets in a Best-of-N approach, so if you have a method to obtain better solutions (e.g. beam search), then you get higher-quality data that can be used for SFT etc.
Whether the search-generated data could help improve verifier robustness
I'm not sure about this since the existing methods like Math-Shepherd already use search in the form of MCTS.
How important is it for the PRM and base model to share the same architecture family?
I don't think this matters too much unless you're doing something like online training with RL where it is common to unify the policy and reward model to the same architecture (often the same weights to initialise)
Would using a PRM from a different model family significantly impact performance or model interaction?
Less so on the model family, but the more important point is the quality of the base model and training data.
Great questions @bird-of-paradise and thanks for the answers + helpful resources @lewtun
I would like to be a part of the HF community and previously contributed to BigScience. Please suggest similar initiatives @lewtun
Hi
@lewtun
,
Thank you for pointing to the Math-Shepherd paper! And thank you
@plaguss
for the implementation!
Reading through it, I've noticed some fascinating patterns that made me wonder about broader implications:
Both test-time optimization and Math-Shepherd achieve better results through more thorough exploration (beam search/multiple completions) rather than requiring more training data. They developed PRMs without human-annotated intermediate steps, and their experiments show that quality of reasoning might be more important than quantity of examples (as shown in the completer experiments with different training sets). Is this move towards 'compute over data' a promising direction for improving reasoning /mathematical tasks?
The paper demonstrates MATH-SHEPHERD being successful as both a reward model and verifier. This made me wonder: Could we push this further and use similar frameworks to improve model's generation capabilities? I'm thinking of how AlphaGo used self-play, learning by evaluating its own moves. While there might be technical challenges (balancing multiple objectives, architecture constraints, training dynamics), could this kind of self-evaluation approach help develop stronger mathematical reasoning capabilities?
The core idea is whether focusing on developing strong evaluation/critical thinking capabilities might naturally lead to better problem-solving abilities, similar to how humans often learn mathematics through understanding why solutions work rather than just memorizing more examples.
Has there been any research exploring these directions? Or am I thinking about this in the wrong way?
Thank you again for your insights!
[edit]
Update after further reading:
I noticed the authors use the Math-Shepherd PRM as a reward model within a PPO training framework to improve LLMs' mathematical reasoning capabilities.
This made me wonder: Could we push this approach further by having the LLM internalize the PRM capabilities, similar to how humans develop both problem-solving and self-verification skills? While having separate models for solving and verification might prevent "cheating", it might also limit the development of true mathematical reasoning capabilities.
Some potential advantages of this approach:
- More efficient (single model vs two models)
- More similar to human learning processes
- Could lead to more robust mathematical understanding
- Potential for continuous self-improvement
Has anyone explored this direction of integrating PRM capabilities directly into the main model? What would be the main technical challenges to overcome?
Building on our earlier discussion about MATH-SHEPHERD and mathematical reasoning, I've developed these ideas further and would appreciate any technical insights or feedback on this direction.
Technical Framework:
- Internal Process Reward Model using PPO optimization
- Multiple completion mechanism for verification robustness
- Hierarchical curriculum learning for stable training progression
- Reward function based on completion success rates
The completion mechanism would:
- Generate partial solutions
- Use multiple completions to assess solution quality
- Calculate rewards based on completion success patterns
- Update model through PPO training
Questions for the community:
- What challenges might arise when using MATH-SHEPHERD's completion-based verification approach for training the same model that generates the solutions?
- How can we ensure training stability when the model acts as both solver and verifier?
- Thoughts on curriculum structure for building mathematical reasoning capabilities?