metadata

base_model:
  - Qwen/Qwen2.5-0.5B-Instruct
language:
  - en
library_name: transformers

Model Card for Model ID

This is a Qwen2.5 0.5B Instruct model which got fine-tuned on a dataset generated by Monte Carlo Tree Search based sampling. MCTS was rolled out on a small subset of the GSM8K train split. The resulting traces & value estimates were then used to form the dataset. Only the last two transformer blocks and the regression head were unfroozen.

The idea is to use only the value network to do MCTS sampling, without the need of simulating/rolling out.

Currently the value network is overfitting, due to very limited samples. Going to update this soon, when I've sampled more data.

Scores on the first 65 samples of the gsm8k test-split:

Beam-search (3 beams): 40.0%
MCTS-search (3 beams): 50.77%

The final rollout of the MCTS-search is done also via Beam-serach. During testing on gsm8k, only the value network was used to guide the search.

All tests were done with Qwen2.5 0.5B Instruct.