Issues faced in reproducing the paper's experiments

by Chensmile - opened 8 days ago

8 days ago

Very interesting work! I am currently trying to reproduce the experimental results from your paper. However, I have encountered two issues:

The generated text tends to have severe repetition.
The model's accuracy on MATH problems (GSM8K dataset) is significantly lower than the reported results in the paper.

I would like to ask whether this discrepancy might be due to the checkpoint used or specific hyperparameter settings (e.g., temperature). Would it be possible to share the exact hyperparameter configurations used in the paper? Thanks!

JonasGeiping

Tom Goldstein's Lab at University of Maryland, College Park org 8 days ago

Hi, are your issues with MATH or with GSM8k? Some more details on GSM8k can be found here: https://huggingface.co./tomg-group-umd/huginn-0125/discussions/7#67b59e08b24bf87803b701b6

Regarding repetition, this has not been a big problem for me, are you using the model as a text completion model, or with the chat template?

Chensmile

8 days ago

Thank you for your response and reminder! I realized that I was using text completion instead of chat templating, which resulted in a lot of repetition. I will try using the lm-eval harness for evaluation to see if I can reproduce the results successfully. Thanks again!

JonasGeiping

Tom Goldstein's Lab at University of Maryland, College Park org 8 days ago

Sure! let me know how it goes, or if there are followup questions.

Chensmile

about 18 hours ago

Hi! Thank you so much for your help. I've successfully conducted some experiments on the GSM8K-COT dataset, and I truly appreciate your support!

I still have two questions that might seem a little bit naive, but I would be really grateful if you could clarify them for me:

Regarding the prompt input for GSM8K-cot in https://huggingface.co./tomg-group-umd/huginn-0125/discussions/7#67b45666f73b1e449308ce91, the mean_recurrence is set as 64. I assume this means that we are using 64 fixed recurrent blocks during all inference process. Is my understanding correct?
I’m planning to conduct more experiments, including code generation tasks and reasoning tasks. Could you kindly provide some details on how to use lm-eval to evaluate performance on the （1）MBPP, （2）Humaneval, and （3）HellaSwag datasets as discussed in the paper? It would be extremely helpful if you could also share the input prompt used for these tasks. Thanks for your time and effort again!

JonasGeiping

Tom Goldstein's Lab at University of Maryland, College Park org about 15 hours ago

Yeah, the simplest way to run the model is to set a fixed budget like that, that is then applied to all tokens during generation.
HellaSwag is easy to run, the task is clasically evaluated without generation, just with lm_eval --model hf --model_args pretrained=tomg-group-umd/MODEL,trust_remote_code=True,dtype="bfloat16",mean_recurrence=X --tasks hellaswag --batch_size=auto --num_fewshot=Y --output_path=outputs/path
MBPP and human-eval however are code benchmarks, and so require code execution, which is not support in the lm-eval harness. We run code benchmarks with BigCode (https://github.com/bigcode-project/bigcodebench). @nsjain might have a snippet to show how that works.

Chensmile

about 15 hours ago

Thanks for your rapid response! It really helps! I will try to use BigCode for evaluation and further experiments. Have a good weekend!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment