Qwen-0.5B-GRPO: A Fine-Tuned Math Reasoner

This model is a fine-tuned version of the Qwen 0.5B model (based on Qwen/Qwen2.5-0.5B-Instruct) using GRPO (Generative Reward Policy Optimization). It has been trained on the GSM8K math dataset to improve its ability to generate step-by-step reasoning for math problems, following a structured output format with explicit <reasoning> and <answer> sections.

Model Details

Model Description

Qwen-0.5B-GRPO is designed to serve as a lightweight math reasoning assistant. By fine-tuning with reinforcement learning using GRPO, the model learns to produce responses that include both intermediate reasoning and final answers. Key adaptations include:

Base Model: Qwen/Qwen2.5-0.5B-Instruct
Fine-Tuning Method: GRPO (reinforcement learning with custom reward functions)
Dataset: GSM8K – a collection of challenging grade-school math problems
Generation Engine: Utilizes vLLM for faster inference on a single GPU setup
Precision: BF16 training for efficiency on Colab GPUs
Developed by: Davut Emre Taşar
License: Please refer to the license of the base model on its Hugging Face Hub page

Model Sources

Repository (this model): https://huggingface.co./emre/Qwen-0.5B-GRPO
Base Model Repository: https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct
Dataset: https://huggingface.co./datasets/openai/gsm8k

Uses

Intended Use

This model is intended for educational and research purposes, particularly to demonstrate and support math problem solving with clear, step-by-step reasoning. It is well-suited for:

Generating structured explanations for math problems.
Serving as a lightweight assistant in educational applications focused on math reasoning.

Out-of-Scope Use

High-Stakes Decision Making: This model is not designed for critical decision making.
Non-Math Domains: Its performance is tailored to math problems; performance on other domains may be limited.
Over-Reliance on Automated Reasoning: The reward functions used during fine-tuning (e.g., exact string matching) may not capture all nuances, so human oversight is recommended.

Bias, Risks, and Limitations

Model Size: With only 0.5B parameters, it may not perform as robustly as larger models.
Training Duration: Fine-tuning was performed for a single epoch; further training might be needed for more challenging tasks.
Reward Function Limitations: The custom reward functions (checking for correct formatting and numerical correctness) are heuristic and may occasionally miss subtleties in reasoning.
Generalization: The structured format (with <reasoning> and <answer> tags) is enforced during training and may require adaptation for other use cases.

Recommendations

Users should:

Validate model outputs on a case-by-case basis.
Consider further fine-tuning for domain-specific applications.
Use the model as a supplementary tool rather than the sole resource for critical math reasoning tasks.

How to Get Started with the Model

Below is an example code snippet to load and use the model:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "emre/Qwen-0.5B-GRPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda")

# Example prompt: structured with <reasoning> and <answer> tags.
prompt = """<reasoning>
Step-by-step reasoning:
</reasoning>
<answer>
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=300)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

emre
/

Qwen-0.5B-GRPO