--- license: cc-by-nc-4.0 --- # ReFT: Reasoning with REinforced Fine-Tuning Paper: https://arxiv.org/pdf/2401.08967.pdf Repo: https://github.com/lqtrung1998/mwp_ReFT (under [Apache2.0 License](https://github.com/lqtrung1998/mwp_ReFT/blob/main/License.txt)) ## Introduction We introduce REinforced Fine-tuning (ReFT), a method that enhances the generalizability of learning LLMs for reasoning. This repository contains: - A Warmup Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k](https://huggingface.co./lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k) - A Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-SFT-GSM8k](https://huggingface.co./lqtrung1998/galactica-6.7b-SFT-GSM8k) - A Rerank model that can score the fine-tuned SFT model output: [lqtrung1998/galactica-6.7b-SFT-Rerank-GSM8k](https://huggingface.co./lqtrung1998/galactica-6.7b-SFT-Rerank-GSM8k) - A REinforced Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-ReFT-GSM8k](https://huggingface.co./lqtrung1998/galactica-6.7b-ReFT-GSM8k) - A Rerank model that can score the fine-tuned ReFT model output: [lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k](https://huggingface.co./lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k) Note: Our models are tuned based on Galactica, thus, licenses applicable to Galactica, such as non-commercial CC BY-NC 4.0 license also hold on these models. | | Top-1 | Voting@100 | Rerank@100 | |--------------------------------------------------------------------|:------:|:----------:|:----------:| | galactica-6.7b-SFT-warmup-GSM8k | 48.37 | - | - | | galactica-6.7b-SFT-GSM8k
(+galactica-6.7b-SFT-Rerank-GSM8k) | 58.83 | 62.9 | 73.4 | | galactica-6.7b-ReFT-GSM8k
(+galactica-6.7b-ReFT-Rerank-GSM8k) | 68.91 | 71.9 | 76.4 | ## Training Data The model is trained on GSM8k data with Python SDP CoT format, which can be found [here](https://github.com/lqtrung1998/mwp_ReFT) ## Training Procedure Check out our paper and repo for complete details. #### ReFT model ReFT model is warm-up via Supervised Fine-tuning using GSM8k Python SDP training data for 2 epochs then it is REinforced Fine-tuned for 300 epochs using questions in GSM8k training set. #### Rerank model Rerank model is trained to classify if the output CoT is correct or not using sampling data of ReFT model after 2 epochs warm-up. ## Evaluation Results See evaluations results of the models at table 4 of the research paper. ## Usage You can use the models through Huggingface's Transformers library or follow scripts in our repo. Prompt format: ```python Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Answer reasoning: ``` Expected response: ```python def solution(): """Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?""" hourly_rate = 12 minutes_worked = 50 hours_worked = minutes_worked / 60 earnings = hourly_rate * hours_worked result = earnings return result ``` ## Citation Please cite the paper if you use our data, model or code. ``` @misc{luong2024reft, title={ReFT: Reasoning with Reinforced Fine-Tuning}, author={Trung Quoc Luong and Xinbo Zhang and Zhanming Jie and Peng Sun and Xiaoran Jin and Hang Li}, year={2024}, eprint={2401.08967}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```