Issues faced in reproducing the paper's experiments
Very interesting work! I am currently trying to reproduce the experimental results from your paper. However, I have encountered two issues:
- The generated text tends to have severe repetition.
- The model's accuracy on MATH problems (GSM8K dataset) is significantly lower than the reported results in the paper.
I would like to ask whether this discrepancy might be due to the checkpoint used or specific hyperparameter settings (e.g., temperature). Would it be possible to share the exact hyperparameter configurations used in the paper? Thanks!
Hi, are your issues with MATH or with GSM8k? Some more details on GSM8k can be found here: https://huggingface.co./tomg-group-umd/huginn-0125/discussions/7#67b59e08b24bf87803b701b6
Regarding repetition, this has not been a big problem for me, are you using the model as a text completion model, or with the chat template?
Thank you for your response and reminder! I realized that I was using text completion instead of chat templating, which resulted in a lot of repetition. I will try using the lm-eval harness for evaluation to see if I can reproduce the results successfully. Thanks again!
Sure! let me know how it goes, or if there are followup questions.
Hi! Thank you so much for your help. I've successfully conducted some experiments on the GSM8K-COT dataset, and I truly appreciate your support!
I still have two questions that might seem a little bit naive, but I would be really grateful if you could clarify them for me:
Regarding the prompt input for GSM8K-cot in https://huggingface.co./tomg-group-umd/huginn-0125/discussions/7#67b45666f73b1e449308ce91, the mean_recurrence is set as 64. I assume this means that we are using 64 fixed recurrent blocks during all inference process. Is my understanding correct?
I’m planning to conduct more experiments, including code generation tasks and reasoning tasks. Could you kindly provide some details on how to use lm-eval to evaluate performance on the (1)MBPP, (2)Humaneval, and (3)HellaSwag datasets as discussed in the paper? It would be extremely helpful if you could also share the input prompt used for these tasks. Thanks for your time and effort again!
- Yeah, the simplest way to run the model is to set a fixed budget like that, that is then applied to all tokens during generation.
- HellaSwag is easy to run, the task is clasically evaluated without generation, just with
lm_eval --model hf --model_args pretrained=tomg-group-umd/MODEL,trust_remote_code=True,dtype="bfloat16",mean_recurrence=X --tasks hellaswag --batch_size=auto --num_fewshot=Y --output_path=outputs/path
- MBPP and human-eval however are code benchmarks, and so require code execution, which is not support in the lm-eval harness. We run code benchmarks with BigCode (https://github.com/bigcode-project/bigcodebench). @nsjain might have a snippet to show how that works.
Thanks for your rapid response! It really helps! I will try to use BigCode for evaluation and further experiments. Have a good weekend!