|
--- |
|
library_name: transformers |
|
tags: |
|
- trl |
|
- grpo |
|
- qwen |
|
- gsm8k |
|
--- |
|
|
|
# Qwen-0.5B-GRPO: A Fine-Tuned Math Reasoner |
|
|
|
This model is a fine-tuned version of the Qwen 0.5B model (based on [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct)) using GRPO (Generative Reward Policy Optimization). It has been trained on the GSM8K math dataset to improve its ability to generate step-by-step reasoning for math problems, following a structured output format with explicit `<reasoning>` and `<answer>` sections. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
Qwen-0.5B-GRPO is designed to serve as a lightweight math reasoning assistant. By fine-tuning with reinforcement learning using GRPO, the model learns to produce responses that include both intermediate reasoning and final answers. Key adaptations include: |
|
- **Base Model:** Qwen/Qwen2.5-0.5B-Instruct |
|
- **Fine-Tuning Method:** GRPO (reinforcement learning with custom reward functions) |
|
- **Dataset:** GSM8K – a collection of challenging grade-school math problems |
|
- **Generation Engine:** Utilizes vLLM for faster inference on a single GPU setup |
|
- **Precision:** BF16 training for efficiency on Colab GPUs |
|
|
|
- **Developed by:** Davut Emre Taşar |
|
- **License:** Please refer to the license of the base model on its Hugging Face Hub page |
|
|
|
### Model Sources |
|
|
|
- **Repository (this model):** [https://huggingface.co./emre/Qwen-0.5B-GRPO](https://huggingface.co./emre/Qwen-0.5B-GRPO) |
|
- **Base Model Repository:** [https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct) |
|
- **Dataset:** [https://huggingface.co./datasets/openai/gsm8k](https://huggingface.co./datasets/openai/gsm8k) |
|
|
|
## Uses |
|
|
|
### Intended Use |
|
|
|
This model is intended for educational and research purposes, particularly to demonstrate and support math problem solving with clear, step-by-step reasoning. It is well-suited for: |
|
- Generating structured explanations for math problems. |
|
- Serving as a lightweight assistant in educational applications focused on math reasoning. |
|
|
|
### Out-of-Scope Use |
|
|
|
- **High-Stakes Decision Making:** This model is not designed for critical decision making. |
|
- **Non-Math Domains:** Its performance is tailored to math problems; performance on other domains may be limited. |
|
- **Over-Reliance on Automated Reasoning:** The reward functions used during fine-tuning (e.g., exact string matching) may not capture all nuances, so human oversight is recommended. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- **Model Size:** With only 0.5B parameters, it may not perform as robustly as larger models. |
|
- **Training Duration:** Fine-tuning was performed for a single epoch; further training might be needed for more challenging tasks. |
|
- **Reward Function Limitations:** The custom reward functions (checking for correct formatting and numerical correctness) are heuristic and may occasionally miss subtleties in reasoning. |
|
- **Generalization:** The structured format (with `<reasoning>` and `<answer>` tags) is enforced during training and may require adaptation for other use cases. |
|
|
|
### Recommendations |
|
|
|
Users should: |
|
- Validate model outputs on a case-by-case basis. |
|
- Consider further fine-tuning for domain-specific applications. |
|
- Use the model as a supplementary tool rather than the sole resource for critical math reasoning tasks. |
|
|
|
## How to Get Started with the Model |
|
|
|
Below is an example code snippet to load and use the model: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
model_name = "emre/Qwen-0.5B-GRPO" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda") |
|
|
|
# Example prompt: structured with <reasoning> and <answer> tags. |
|
prompt = """<reasoning> |
|
Step-by-step reasoning: |
|
</reasoning> |
|
<answer> |
|
""" |
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
outputs = model.generate(**inputs, max_length=300) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|