Qwen-0.5B-GRPO / README.md
emre's picture
Update README.md
0f9ffa2 verified
---
library_name: transformers
tags:
- trl
- grpo
- qwen
- gsm8k
---
# Qwen-0.5B-GRPO: A Fine-Tuned Math Reasoner
This model is a fine-tuned version of the Qwen 0.5B model (based on [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct)) using GRPO (Generative Reward Policy Optimization). It has been trained on the GSM8K math dataset to improve its ability to generate step-by-step reasoning for math problems, following a structured output format with explicit `<reasoning>` and `<answer>` sections.
## Model Details
### Model Description
Qwen-0.5B-GRPO is designed to serve as a lightweight math reasoning assistant. By fine-tuning with reinforcement learning using GRPO, the model learns to produce responses that include both intermediate reasoning and final answers. Key adaptations include:
- **Base Model:** Qwen/Qwen2.5-0.5B-Instruct
- **Fine-Tuning Method:** GRPO (reinforcement learning with custom reward functions)
- **Dataset:** GSM8K – a collection of challenging grade-school math problems
- **Generation Engine:** Utilizes vLLM for faster inference on a single GPU setup
- **Precision:** BF16 training for efficiency on Colab GPUs
- **Developed by:** Davut Emre Taşar
- **License:** Please refer to the license of the base model on its Hugging Face Hub page
### Model Sources
- **Repository (this model):** [https://huggingface.co./emre/Qwen-0.5B-GRPO](https://huggingface.co./emre/Qwen-0.5B-GRPO)
- **Base Model Repository:** [https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct)
- **Dataset:** [https://huggingface.co./datasets/openai/gsm8k](https://huggingface.co./datasets/openai/gsm8k)
## Uses
### Intended Use
This model is intended for educational and research purposes, particularly to demonstrate and support math problem solving with clear, step-by-step reasoning. It is well-suited for:
- Generating structured explanations for math problems.
- Serving as a lightweight assistant in educational applications focused on math reasoning.
### Out-of-Scope Use
- **High-Stakes Decision Making:** This model is not designed for critical decision making.
- **Non-Math Domains:** Its performance is tailored to math problems; performance on other domains may be limited.
- **Over-Reliance on Automated Reasoning:** The reward functions used during fine-tuning (e.g., exact string matching) may not capture all nuances, so human oversight is recommended.
## Bias, Risks, and Limitations
- **Model Size:** With only 0.5B parameters, it may not perform as robustly as larger models.
- **Training Duration:** Fine-tuning was performed for a single epoch; further training might be needed for more challenging tasks.
- **Reward Function Limitations:** The custom reward functions (checking for correct formatting and numerical correctness) are heuristic and may occasionally miss subtleties in reasoning.
- **Generalization:** The structured format (with `<reasoning>` and `<answer>` tags) is enforced during training and may require adaptation for other use cases.
### Recommendations
Users should:
- Validate model outputs on a case-by-case basis.
- Consider further fine-tuning for domain-specific applications.
- Use the model as a supplementary tool rather than the sole resource for critical math reasoning tasks.
## How to Get Started with the Model
Below is an example code snippet to load and use the model:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "emre/Qwen-0.5B-GRPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda")
# Example prompt: structured with <reasoning> and <answer> tags.
prompt = """<reasoning>
Step-by-step reasoning:
</reasoning>
<answer>
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=300)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))