license: cc-by-nc-4.0
ReFT: Reasoning with REinforced Fine-Tuning
Paper: https://arxiv.org/pdf/2401.08967.pdf
Repo: https://github.com/lqtrung1998/mwp_ReFT (under Apache2.0 License)
Introduction
We introduce REinforced Fine-tuning (ReFT), a method that enhances the generalizability of learning LLMs for reasoning.
This repository contains:
- A Warmup Supervised Fine-tuned model on GSM8k benchmark: lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k
- A Supervised Fine-tuned model on GSM8k benchmark: lqtrung1998/galactica-6.7b-SFT-GSM8k
- A Rerank model that can score the fine-tuned SFT model output: lqtrung1998/galactica-6.7b-SFT-Rerank-GSM8k
- A REinforced Fine-tuned model on GSM8k benchmark: lqtrung1998/galactica-6.7b-ReFT-GSM8k
- A Rerank model that can score the fine-tuned ReFT model output: lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k
Note: Our models are tuned based on Galactica, thus, licenses applicable to Galactica, such as non-commercial CC BY-NC 4.0 license also hold on these models.
Training Data
The model is trained on GSM8k data with Python SDP CoT format, which can be found here
Training Procedure
Check out our paper and repo for complete details.
ReFT model
ReFT model is warm-up via Supervised Fine-tuning using GSM8k Python SDP training data for 2 epochs then it is REinforced Fine-tuned for 300 epochs using questions in GSM8k training set.
Rerank model
Rerank model is trained to classify if the output CoT is correct or not using sampling data of ReFT model after 2 epochs warm-up.
Evaluation Results
See evaluations results of the models at table 4 of the research paper.
Updated results:
Top-1 | Voting@100 | Rerank@100 | |
---|---|---|---|
galactica-6.7b-SFT-warmup-GSM8k | 48.37 | - | - |
galactica-6.7b-SFT-GSM8k (+galactica-6.7b-SFT-Rerank-GSM8k) |
58.83 | 62.9 | 73.4 |
galactica-6.7b-ReFT-GSM8k (+galactica-6.7b-ReFT-Rerank-GSM8k) |
68.91 | 71.9 | 76.4 |
Usage
You can use the models through Huggingface's Transformers library or follow scripts in our repo.
Prompt format:
Question:
Weng earns $12 an hour for babysitting. Yesterday, she
just did 50 minutes of babysitting. How much did she earn?
Answer reasoning:
Expected response:
def solution():
"""Weng earns $12 an hour for babysitting. Yesterday, she just did
50 minutes of babysitting. How much did she earn?"""
hourly_rate = 12
minutes_worked = 50
hours_worked = minutes_worked / 60
earnings = hourly_rate * hours_worked
result = earnings
return result
Citation
Please cite the paper if you use our data, model or code.
@misc{luong2024reft,
title={ReFT: Reasoning with Reinforced Fine-Tuning},
author={Trung Quoc Luong and Xinbo Zhang and Zhanming Jie and Peng Sun and Xiaoran Jin and Hang Li},
year={2024},
eprint={2401.08967},
archivePrefix={arXiv},
primaryClass={cs.CL}
}