metadata

license: cc-by-nc-4.0

ReFT: Reasoning with REinforced Fine-Tuning

Paper: https://arxiv.org/pdf/2401.08967.pdf

Repo: https://github.com/lqtrung1998/mwp_ReFT (under Apache2.0 License)

Introduction

We introduce REinforced Fine-tuning (ReFT), a method that enhances the generalizability of learning LLMs for reasoning.

This repository contains:

A Warmup Supervised Fine-tuned model on GSM8k benchmark: lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k
A Supervised Fine-tuned model on GSM8k benchmark: lqtrung1998/galactica-6.7b-SFT-GSM8k
A Rerank model that can score the fine-tuned SFT model output: lqtrung1998/galactica-6.7b-SFT-Rerank-GSM8k
A REinforced Fine-tuned model on GSM8k benchmark: lqtrung1998/galactica-6.7b-ReFT-GSM8k
A Rerank model that can score the fine-tuned ReFT model output: lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k

Note: Our models are tuned based on Galactica, thus, licenses applicable to Galactica, such as non-commercial CC BY-NC 4.0 license also hold on these models.

	Top-1	Voting@100	Rerank@100
galactica-6.7b-SFT-warmup-GSM8k	48.37	-	-
galactica-6.7b-SFT-GSM8k (+galactica-6.7b-SFT-Rerank-GSM8k)	58.83	62.9	73.4
galactica-6.7b-ReFT-GSM8k (+galactica-6.7b-ReFT-Rerank-GSM8k)	68.91	71.9	76.4

Training Data

The model is trained on GSM8k data with Python SDP CoT format, which can be found here

Training Procedure

Check out our paper and repo for complete details.

ReFT model

ReFT model is warm-up via Supervised Fine-tuning using GSM8k Python SDP training data for 2 epochs then it is REinforced Fine-tuned for 300 epochs using questions in GSM8k training set.

Rerank model

Rerank model is trained to classify if the output CoT is correct or not using sampling data of ReFT model after 2 epochs warm-up.

Evaluation Results

See evaluations results of the models at table 4 of the research paper.

Usage

You can use the models through Huggingface's Transformers library or follow scripts in our repo.

Prompt format:

Question:
Weng earns $12 an hour for babysitting. Yesterday, she
just did 50 minutes of babysitting. How much did she earn?
Answer reasoning:

Expected response:

def solution():
  """Weng earns $12 an hour for babysitting. Yesterday, she just did
  50 minutes of babysitting. How much did she earn?"""
  hourly_rate = 12
  minutes_worked = 50
  hours_worked = minutes_worked / 60
  earnings = hourly_rate * hours_worked
  result = earnings
  return result

Citation

Please cite the paper if you use our data, model or code.

@misc{luong2024reft,
      title={ReFT: Reasoning with Reinforced Fine-Tuning}, 
      author={Trung Quoc Luong and Xinbo Zhang and Zhanming Jie and Peng Sun and Xiaoran Jin and Hang Li},
      year={2024},
      eprint={2401.08967},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}