Is the gsm8k evaluated few-shot (no CoT)?

#365
by imone - opened

Why is gsm8k lower than the results reported in the paper? For example, Llama 2 70b is 56.8 (reported) vs. 33.9 (leaderboard), as reported here. Is it evaluated using a few-shot (no CoT) setting, whereas it is typically run with a few-shot/zero-shot CoT?

Open LLM Leaderboard org

Hi! I'm closing this issue as it has already been discussed in the other one you pointed out, let's centralize discussions :)

clefourrier changed discussion status to closed

Sign up or log in to comment