Performance on minor perturbations on GSM-8K problem

#3
by raishish - opened

The paper claims the model LlaDA-8B scores 78.6% on GSM-8K. So I tried perturbing the values on one of the problems in the dataset and it can't even get partial solutions correct.

Original problem from GSM-8K

πŸ˜ƒ: Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?
πŸ€–: Mark has 10 * 1.8 = 18 purple flowers.
Mark has 28 * 25% = 7 green flowers.
In total Mark has 10 + 18 + 7 = 35 flowers.
Con answer: 35

Perturbed Values

πŸ˜ƒ: Mark has a garden with flowers. He planted plants of three different colors in it. 53 of them are yellow, and there are 67% more of those in purple. There are only 38% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?
πŸ€–: Mark has 53 + 34 = 87 purple flowers and
87 + 37 = 64 green flowers.
In total, Mark has 53 + 87 + 64 = 204 flowers.
Conclusively: 204

Correct Answer for above: 185

Not sure what the issue is here but this is very disappointing. Going to try loading the model in a Colab and try it myself.

Sign up or log in to comment