Qwen-0.5B-GRPO / README.md

Update README.md

0f9ffa2 verified about 1 month ago

4.02 kB

	---
	library_name: transformers
	tags:
	- trl
	- grpo
	- qwen
	- gsm8k
	---

	# Qwen-0.5B-GRPO: A Fine-Tuned Math Reasoner

	This model is a fine-tuned version of the Qwen 0.5B model (based on [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct)) using GRPO (Generative Reward Policy Optimization). It has been trained on the GSM8K math dataset to improve its ability to generate step-by-step reasoning for math problems, following a structured output format with explicit `<reasoning>` and `<answer>` sections.

	## Model Details

	### Model Description

	Qwen-0.5B-GRPO is designed to serve as a lightweight math reasoning assistant. By fine-tuning with reinforcement learning using GRPO, the model learns to produce responses that include both intermediate reasoning and final answers. Key adaptations include:
	- Base Model: Qwen/Qwen2.5-0.5B-Instruct
	- Fine-Tuning Method: GRPO (reinforcement learning with custom reward functions)
	- Dataset: GSM8K – a collection of challenging grade-school math problems
	- Generation Engine: Utilizes vLLM for faster inference on a single GPU setup
	- Precision: BF16 training for efficiency on Colab GPUs

	- Developed by: Davut Emre Taşar
	- License: Please refer to the license of the base model on its Hugging Face Hub page

	### Model Sources

	- Repository (this model): [https://huggingface.co./emre/Qwen-0.5B-GRPO](https://huggingface.co./emre/Qwen-0.5B-GRPO)
	- Base Model Repository: [https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct)
	- Dataset: [https://huggingface.co./datasets/openai/gsm8k](https://huggingface.co./datasets/openai/gsm8k)

	## Uses

	### Intended Use

	This model is intended for educational and research purposes, particularly to demonstrate and support math problem solving with clear, step-by-step reasoning. It is well-suited for:
	- Generating structured explanations for math problems.
	- Serving as a lightweight assistant in educational applications focused on math reasoning.

	### Out-of-Scope Use

	- High-Stakes Decision Making: This model is not designed for critical decision making.
	- Non-Math Domains: Its performance is tailored to math problems; performance on other domains may be limited.
	- Over-Reliance on Automated Reasoning: The reward functions used during fine-tuning (e.g., exact string matching) may not capture all nuances, so human oversight is recommended.

	## Bias, Risks, and Limitations

	- Model Size: With only 0.5B parameters, it may not perform as robustly as larger models.
	- Training Duration: Fine-tuning was performed for a single epoch; further training might be needed for more challenging tasks.
	- Reward Function Limitations: The custom reward functions (checking for correct formatting and numerical correctness) are heuristic and may occasionally miss subtleties in reasoning.
	- Generalization: The structured format (with `<reasoning>` and `<answer>` tags) is enforced during training and may require adaptation for other use cases.

	### Recommendations

	Users should:
	- Validate model outputs on a case-by-case basis.
	- Consider further fine-tuning for domain-specific applications.
	- Use the model as a supplementary tool rather than the sole resource for critical math reasoning tasks.

	## How to Get Started with the Model

	Below is an example code snippet to load and use the model:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "emre/Qwen-0.5B-GRPO"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda")

	# Example prompt: structured with <reasoning> and <answer> tags.
	prompt = """<reasoning>
	Step-by-step reasoning:
	</reasoning>
	<answer>
	"""
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs, max_length=300)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))