CopyBench-leaderboard / results.csv
Tong Chen
reformat data
64af7e2
Model,Inference,Literal Copying (%),(Non-literal) Event Copying (%),(Non-literal) Character Copying (%),Fact Recall (F1),(Literal) Fluency,(Non-literal) Fluency
gpt-35-turbo,greedy,2.023,1.356,1.412,36.090,3.496,4.340
gpt-4-turbo,greedy,0.440,2.948,4.509,41.909,3.940,4.667
llama2-13b-chat,greedy,0.000,0.226,0.565,17.187,3.922,4.202
llama2-13b,greedy,0.088,0.339,2.034,20.946,2.515,3.021
llama2-13b,memfree,0.000,0.339,2.034,20.946,2.552,3.050
llama2-13b,system_prompt,0.044,0.452,2.034,19.796,2.576,3.123
llama2-13b-vicuna,greedy,0.088,0.452,1.412,16.181,3.650,4.176
llama2-70b,greedy,2.419,4.011,10.339,30.123,2.804,3.347
llama2-70b,memfree,0.308,3.842,10.904,30.123,2.761,3.341
llama2-70b,system_prompt,2.595,4.746,11.525,29.903,2.751,3.356
llama2-70b-chat,greedy,0.132,0.734,1.130,21.190,3.628,4.156
llama2-7b,greedy,0.132,0.226,1.695,15.342,2.380,2.858
llama3-70b,greedy,10.510,6.893,15.593,39.982,2.708,3.207
llama3-70b,memfree,0.616,7.232,15.537,39.982,2.667,3.196
llama3-70b,system_prompt,11.038,5.932,15.028,39.924,2.736,3.276
llama3-70b-instruct,greedy,0.220,1.243,4.237,30.208,3.238,4.405
llama3-8b,greedy,0.176,2.316,4.463,18.609,2.577,2.737
mistral-7b,greedy,0.132,0.395,1.921,18.713,2.280,2.796
mixtral-8x7b,greedy,0.967,1.299,6.949,23.322,2.982,3.538
mixtral-8x7b-instruct,greedy,0.088,1.977,2.938,21.276,3.421,4.256
tulu2-13b,greedy,0.000,0.621,1.582,17.898,2.932,4.006
tulu2-13b-dpo,greedy,0.088,1.525,1.751,17.326,3.449,4.197
tulu2-70b,greedy,1.011,2.825,4.633,28.322,2.922,4.020
tulu2-70b,memfree,0.088,2.881,4.407,28.322,2.913,4.045
tulu2-70b,system_prompt,0.748,2.034,3.277,28.307,3.044,4.117
tulu2-70b-dpo,greedy,0.352,2.147,3.390,28.836,3.481,4.363