PRM-Math-7B-Reasoner - Process Reward Model

PRM's : To identify and mitigate intermediate errors in the reasoning processes

PRM-Math-7B-Reasoner is a fully reproducible model, fine-tuned on the Qwen2.5-Math-7B-PRM800K dataset, designed to evaluate its ability to identify erroneous steps in mathematical reasoning. The model is used for reward computation, where after each step, a special token "" is inserted. For reward calculation, the probability score of this token being classified as positive is extracted, resulting in a reward value between 0 and 1. It is primarily utilized for solution reformatting in mathematically driven tasks and as a Long Context Full Reasoner.

PROCESSBENCH : PAPER

PROCESSBENCH: Identifying Process Errors in Mathematical Reasoning (arXiv) : https://arxiv.org/pdf/2412.06559

Reformatting Reasoning Intermediates

Section	Content
Title	Example of Solution Reformatting
Description	This example demonstrates the reformatting of a solution for finding the foci of an ellipse. The original solution is generated by Qwen2-7B-Instruct, and the reformatted version is presented for clarity.
Problem Statement	The ellipse (\frac{(x-6)^{2}}{25}+\frac{(y-3)^{2}}{9}=1) has two foci. Find the one with the larger x-coordinate. Enter your answer as an ordered pair, like ((2,1)).
Original Solution	The original solution is provided in a less readable format, with some syntax errors and unclear notation.
Reformatted Solution	The reformatted solution clearly explains the steps: 1. Identify the center of the ellipse ((6,3)). 2. Calculate the distance from the center to each focus using (\sqrt{a^2 - b^2}). 3. Determine the foci locations at ((2,3)) and ((10,3)). 4. Identify the focus with the larger x-coordinate as ((10,3)).

Hugging Face Transformers

import torch
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F


def make_step_rewards(logits, token_masks):
    probabilities = F.softmax(logits, dim=-1)
    probabilities = probabilities * token_masks.unsqueeze(-1) # bs, seq_len, num_labels
    
    all_scores_res = []
    for i in range(probabilities.size(0)):
        sample = probabilities[i] # seq_len, num_labels
        positive_probs = sample[sample != 0].view(-1, 2)[:, 1] # valid_tokens, num_labels
        non_zero_elements_list = positive_probs.cpu().tolist()
        all_scores_res.append(non_zero_elements_list)
    return all_scores_res


model_name = "prithivMLmods/PRM-Math-7B-Reasoner"
device = "auto"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    device_map=device, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).eval()


data = {
    "system": "Please reason step by step, and put your final answer within \\boxed{}.",
    "query": "Sue lives in a fun neighborhood.  One weekend, the neighbors decided to play a prank on Sue.  On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard.  On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard.  Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?",
    "response": [
      "To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.",
      "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, (1/3 \\times 18 = 6) flamingos are taken back. So, they have (18 - 6 = 12) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has (12 + 6 = 18) pink flamingos and 6 white flamingos.",
      "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has (18 + 18 = 36) pink flamingos and still 6 white flamingos.",
      "To find the difference, subtract the number of white flamingos from the number of pink flamingos: (36 - 6 = 30). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is (\\boxed{30})."
    ]
}

messages = [
    {"role": "system", "content": data['system']},
    {"role": "user", "content": data['query']},
    {"role": "assistant", "content": "<extra_0>".join(data['response']) + "<extra_0>"},
]
conversation_str = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=False
)

input_ids = tokenizer.encode(
    conversation_str, 
    return_tensors="pt", 
).to(model.device)

outputs = model(input_ids=input_ids)

step_sep_id = tokenizer.encode("<extra_0>")[0]
token_masks = (input_ids == step_sep_id)
step_reward = make_step_rewards(outputs[0], token_masks)
print(step_reward)  # [[0.9921875, 0.2333984375, 0.6796875, 0.94140625]]

prithivMLmods
/

PRM-Math-7B-Reasoner

PRM-Math-7B-Reasoner - Process Reward Model

PROCESSBENCH : PAPER

Reformatting Reasoning Intermediates

Hugging Face Transformers

Model tree for prithivMLmods/PRM-Math-7B-Reasoner