Qwen-0.5B-GRPO: A Fine-Tuned Math Reasoner

This model is a fine-tuned version of the Qwen 0.5B model (based on Qwen/Qwen2.5-0.5B-Instruct) using GRPO (Generative Reward Policy Optimization). It has been trained on the GSM8K math dataset to improve its ability to generate step-by-step reasoning for math problems, following a structured output format with explicit <reasoning> and <answer> sections.

Model Details

Model Description

Qwen-0.5B-GRPO is designed to serve as a lightweight math reasoning assistant. By fine-tuning with reinforcement learning using GRPO, the model learns to produce responses that include both intermediate reasoning and final answers. Key adaptations include:

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct

  • Fine-Tuning Method: GRPO (reinforcement learning with custom reward functions)

  • Dataset: GSM8K – a collection of challenging grade-school math problems

  • Generation Engine: Utilizes vLLM for faster inference on a single GPU setup

  • Precision: BF16 training for efficiency on Colab GPUs

  • Developed by: Davut Emre Taşar

  • License: Please refer to the license of the base model on its Hugging Face Hub page

Model Sources

Uses

Intended Use

This model is intended for educational and research purposes, particularly to demonstrate and support math problem solving with clear, step-by-step reasoning. It is well-suited for:

  • Generating structured explanations for math problems.
  • Serving as a lightweight assistant in educational applications focused on math reasoning.

Out-of-Scope Use

  • High-Stakes Decision Making: This model is not designed for critical decision making.
  • Non-Math Domains: Its performance is tailored to math problems; performance on other domains may be limited.
  • Over-Reliance on Automated Reasoning: The reward functions used during fine-tuning (e.g., exact string matching) may not capture all nuances, so human oversight is recommended.

Bias, Risks, and Limitations

  • Model Size: With only 0.5B parameters, it may not perform as robustly as larger models.
  • Training Duration: Fine-tuning was performed for a single epoch; further training might be needed for more challenging tasks.
  • Reward Function Limitations: The custom reward functions (checking for correct formatting and numerical correctness) are heuristic and may occasionally miss subtleties in reasoning.
  • Generalization: The structured format (with <reasoning> and <answer> tags) is enforced during training and may require adaptation for other use cases.

Recommendations

Users should:

  • Validate model outputs on a case-by-case basis.
  • Consider further fine-tuning for domain-specific applications.
  • Use the model as a supplementary tool rather than the sole resource for critical math reasoning tasks.

How to Get Started with the Model

Below is an example code snippet to load and use the model:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "emre/Qwen-0.5B-GRPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda")

# Example prompt: structured with <reasoning> and <answer> tags.
prompt = """<reasoning>
Step-by-step reasoning:
</reasoning>
<answer>
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=300)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
12
Safetensors
Model size
494M params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for emre/Qwen-0.5B-GRPO

Quantizations
1 model