taozi555 commited on
Commit
4ea016f
1 Parent(s): 5b6e92e

Upload results_2024-04-22T00-31-29.434502.json

Browse files
results_2024-04-22T00-31-29.434502.json ADDED
@@ -0,0 +1,3337 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config_general": {
3
+ "lighteval_sha": "?",
4
+ "num_fewshot_seeds": 1,
5
+ "override_batch_size": -1,
6
+ "max_samples": null,
7
+ "job_id": "",
8
+ "start_time": 6506311.714952203,
9
+ "end_time": 6527604.5975109,
10
+ "total_evaluation_time_secondes": "21292.882558696903",
11
+ "model_name": "taozi555/llama3-Mirage-Walker-8b",
12
+ "model_sha": "f14b1a5faecce896e7f12c601756ed2aa3680cac",
13
+ "model_dtype": "torch.bfloat16",
14
+ "model_size": "15.08 GB",
15
+ "config": null
16
+ },
17
+ "results": {
18
+ "leaderboard|arc:challenge|25": {
19
+ "acc": 0.5767918088737202,
20
+ "acc_stderr": 0.01443803622084802,
21
+ "acc_norm": 0.5810580204778157,
22
+ "acc_norm_stderr": 0.014418106953639013
23
+ },
24
+ "leaderboard|hellaswag|10": {
25
+ "acc": 0.6086436964748058,
26
+ "acc_stderr": 0.004870563921220625,
27
+ "acc_norm": 0.783608842859988,
28
+ "acc_norm_stderr": 0.004109423832097878
29
+ },
30
+ "leaderboard|mmlu:abstract_algebra|5": {
31
+ "acc": 0.43,
32
+ "acc_stderr": 0.049756985195624284
33
+ },
34
+ "leaderboard|mmlu:anatomy|5": {
35
+ "acc": 0.674074074074074,
36
+ "acc_stderr": 0.040491220417025055
37
+ },
38
+ "leaderboard|mmlu:astronomy|5": {
39
+ "acc": 0.7631578947368421,
40
+ "acc_stderr": 0.03459777606810536
41
+ },
42
+ "leaderboard|mmlu:business_ethics|5": {
43
+ "acc": 0.69,
44
+ "acc_stderr": 0.04648231987117316
45
+ },
46
+ "leaderboard|mmlu:clinical_knowledge|5": {
47
+ "acc": 0.7509433962264151,
48
+ "acc_stderr": 0.026616482980501704
49
+ },
50
+ "leaderboard|mmlu:college_biology|5": {
51
+ "acc": 0.8333333333333334,
52
+ "acc_stderr": 0.031164899666948617
53
+ },
54
+ "leaderboard|mmlu:college_chemistry|5": {
55
+ "acc": 0.48,
56
+ "acc_stderr": 0.050211673156867795
57
+ },
58
+ "leaderboard|mmlu:college_computer_science|5": {
59
+ "acc": 0.62,
60
+ "acc_stderr": 0.048783173121456316
61
+ },
62
+ "leaderboard|mmlu:college_mathematics|5": {
63
+ "acc": 0.4,
64
+ "acc_stderr": 0.04923659639173309
65
+ },
66
+ "leaderboard|mmlu:college_medicine|5": {
67
+ "acc": 0.6705202312138728,
68
+ "acc_stderr": 0.03583901754736412
69
+ },
70
+ "leaderboard|mmlu:college_physics|5": {
71
+ "acc": 0.45098039215686275,
72
+ "acc_stderr": 0.049512182523962625
73
+ },
74
+ "leaderboard|mmlu:computer_security|5": {
75
+ "acc": 0.79,
76
+ "acc_stderr": 0.04093601807403326
77
+ },
78
+ "leaderboard|mmlu:conceptual_physics|5": {
79
+ "acc": 0.6085106382978723,
80
+ "acc_stderr": 0.03190701242326812
81
+ },
82
+ "leaderboard|mmlu:econometrics|5": {
83
+ "acc": 0.5,
84
+ "acc_stderr": 0.047036043419179864
85
+ },
86
+ "leaderboard|mmlu:electrical_engineering|5": {
87
+ "acc": 0.6275862068965518,
88
+ "acc_stderr": 0.0402873153294756
89
+ },
90
+ "leaderboard|mmlu:elementary_mathematics|5": {
91
+ "acc": 0.47619047619047616,
92
+ "acc_stderr": 0.02572209706438853
93
+ },
94
+ "leaderboard|mmlu:formal_logic|5": {
95
+ "acc": 0.5476190476190477,
96
+ "acc_stderr": 0.044518079590553275
97
+ },
98
+ "leaderboard|mmlu:global_facts|5": {
99
+ "acc": 0.46,
100
+ "acc_stderr": 0.05009082659620332
101
+ },
102
+ "leaderboard|mmlu:high_school_biology|5": {
103
+ "acc": 0.8129032258064516,
104
+ "acc_stderr": 0.022185710092252252
105
+ },
106
+ "leaderboard|mmlu:high_school_chemistry|5": {
107
+ "acc": 0.5862068965517241,
108
+ "acc_stderr": 0.03465304488406795
109
+ },
110
+ "leaderboard|mmlu:high_school_computer_science|5": {
111
+ "acc": 0.72,
112
+ "acc_stderr": 0.04512608598542127
113
+ },
114
+ "leaderboard|mmlu:high_school_european_history|5": {
115
+ "acc": 0.7454545454545455,
116
+ "acc_stderr": 0.03401506715249039
117
+ },
118
+ "leaderboard|mmlu:high_school_geography|5": {
119
+ "acc": 0.8383838383838383,
120
+ "acc_stderr": 0.02622591986362928
121
+ },
122
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
123
+ "acc": 0.917098445595855,
124
+ "acc_stderr": 0.01989934131572178
125
+ },
126
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
127
+ "acc": 0.6717948717948717,
128
+ "acc_stderr": 0.02380763319865726
129
+ },
130
+ "leaderboard|mmlu:high_school_mathematics|5": {
131
+ "acc": 0.3888888888888889,
132
+ "acc_stderr": 0.029723278961476664
133
+ },
134
+ "leaderboard|mmlu:high_school_microeconomics|5": {
135
+ "acc": 0.7605042016806722,
136
+ "acc_stderr": 0.027722065493361252
137
+ },
138
+ "leaderboard|mmlu:high_school_physics|5": {
139
+ "acc": 0.4304635761589404,
140
+ "acc_stderr": 0.04042809961395634
141
+ },
142
+ "leaderboard|mmlu:high_school_psychology|5": {
143
+ "acc": 0.8642201834862385,
144
+ "acc_stderr": 0.014686907556340022
145
+ },
146
+ "leaderboard|mmlu:high_school_statistics|5": {
147
+ "acc": 0.5416666666666666,
148
+ "acc_stderr": 0.033981108902946366
149
+ },
150
+ "leaderboard|mmlu:high_school_us_history|5": {
151
+ "acc": 0.8578431372549019,
152
+ "acc_stderr": 0.02450980392156861
153
+ },
154
+ "leaderboard|mmlu:high_school_world_history|5": {
155
+ "acc": 0.8396624472573839,
156
+ "acc_stderr": 0.02388438092596567
157
+ },
158
+ "leaderboard|mmlu:human_aging|5": {
159
+ "acc": 0.695067264573991,
160
+ "acc_stderr": 0.030898610882477518
161
+ },
162
+ "leaderboard|mmlu:human_sexuality|5": {
163
+ "acc": 0.7786259541984732,
164
+ "acc_stderr": 0.03641297081313729
165
+ },
166
+ "leaderboard|mmlu:international_law|5": {
167
+ "acc": 0.8264462809917356,
168
+ "acc_stderr": 0.03457272836917669
169
+ },
170
+ "leaderboard|mmlu:jurisprudence|5": {
171
+ "acc": 0.8148148148148148,
172
+ "acc_stderr": 0.03755265865037181
173
+ },
174
+ "leaderboard|mmlu:logical_fallacies|5": {
175
+ "acc": 0.7484662576687117,
176
+ "acc_stderr": 0.034089978868575295
177
+ },
178
+ "leaderboard|mmlu:machine_learning|5": {
179
+ "acc": 0.5089285714285714,
180
+ "acc_stderr": 0.04745033255489123
181
+ },
182
+ "leaderboard|mmlu:management|5": {
183
+ "acc": 0.8349514563106796,
184
+ "acc_stderr": 0.036756688322331886
185
+ },
186
+ "leaderboard|mmlu:marketing|5": {
187
+ "acc": 0.8846153846153846,
188
+ "acc_stderr": 0.020930193185179326
189
+ },
190
+ "leaderboard|mmlu:medical_genetics|5": {
191
+ "acc": 0.81,
192
+ "acc_stderr": 0.03942772444036623
193
+ },
194
+ "leaderboard|mmlu:miscellaneous|5": {
195
+ "acc": 0.8454661558109834,
196
+ "acc_stderr": 0.012925773495095974
197
+ },
198
+ "leaderboard|mmlu:moral_disputes|5": {
199
+ "acc": 0.7254335260115607,
200
+ "acc_stderr": 0.024027745155265012
201
+ },
202
+ "leaderboard|mmlu:moral_scenarios|5": {
203
+ "acc": 0.41787709497206704,
204
+ "acc_stderr": 0.01649540063582008
205
+ },
206
+ "leaderboard|mmlu:nutrition|5": {
207
+ "acc": 0.7712418300653595,
208
+ "acc_stderr": 0.024051029739912248
209
+ },
210
+ "leaderboard|mmlu:philosophy|5": {
211
+ "acc": 0.7427652733118971,
212
+ "acc_stderr": 0.024826171289250888
213
+ },
214
+ "leaderboard|mmlu:prehistory|5": {
215
+ "acc": 0.7438271604938271,
216
+ "acc_stderr": 0.024288533637726095
217
+ },
218
+ "leaderboard|mmlu:professional_accounting|5": {
219
+ "acc": 0.5070921985815603,
220
+ "acc_stderr": 0.02982449855912901
221
+ },
222
+ "leaderboard|mmlu:professional_law|5": {
223
+ "acc": 0.4810951760104302,
224
+ "acc_stderr": 0.012761104871472658
225
+ },
226
+ "leaderboard|mmlu:professional_medicine|5": {
227
+ "acc": 0.75,
228
+ "acc_stderr": 0.026303648393696036
229
+ },
230
+ "leaderboard|mmlu:professional_psychology|5": {
231
+ "acc": 0.6993464052287581,
232
+ "acc_stderr": 0.01855063450295296
233
+ },
234
+ "leaderboard|mmlu:public_relations|5": {
235
+ "acc": 0.6545454545454545,
236
+ "acc_stderr": 0.04554619617541054
237
+ },
238
+ "leaderboard|mmlu:security_studies|5": {
239
+ "acc": 0.7510204081632653,
240
+ "acc_stderr": 0.027682979522960227
241
+ },
242
+ "leaderboard|mmlu:sociology|5": {
243
+ "acc": 0.8308457711442786,
244
+ "acc_stderr": 0.026508590656233268
245
+ },
246
+ "leaderboard|mmlu:us_foreign_policy|5": {
247
+ "acc": 0.86,
248
+ "acc_stderr": 0.03487350880197769
249
+ },
250
+ "leaderboard|mmlu:virology|5": {
251
+ "acc": 0.5120481927710844,
252
+ "acc_stderr": 0.03891364495835817
253
+ },
254
+ "leaderboard|mmlu:world_religions|5": {
255
+ "acc": 0.8245614035087719,
256
+ "acc_stderr": 0.029170885500727665
257
+ },
258
+ "leaderboard|truthfulqa:mc|0": {
259
+ "truthfulqa_mc1": 0.34761321909424725,
260
+ "truthfulqa_mc1_stderr": 0.016670769188897306,
261
+ "truthfulqa_mc2": 0.5156373144783575,
262
+ "truthfulqa_mc2_stderr": 0.015703082442877
263
+ },
264
+ "leaderboard|winogrande|5": {
265
+ "acc": 0.7529597474348856,
266
+ "acc_stderr": 0.012121402942855576
267
+ },
268
+ "leaderboard|gsm8k|5": {
269
+ "qem": 0.6724791508718726,
270
+ "qem_stderr": 0.012927102210426719
271
+ },
272
+ "leaderboard|mmlu:_average|5": {
273
+ "acc": 0.6801243622973331,
274
+ "acc_stderr": 0.03296281402260026
275
+ },
276
+ "all": {
277
+ "acc": 0.6784247317288568,
278
+ "acc_stderr": 0.03183850670621899,
279
+ "acc_norm": 0.6823334316689018,
280
+ "acc_norm_stderr": 0.009263765392868446,
281
+ "truthfulqa_mc1": 0.34761321909424725,
282
+ "truthfulqa_mc1_stderr": 0.016670769188897306,
283
+ "truthfulqa_mc2": 0.5156373144783575,
284
+ "truthfulqa_mc2_stderr": 0.015703082442877,
285
+ "qem": 0.6724791508718726,
286
+ "qem_stderr": 0.012927102210426719
287
+ }
288
+ },
289
+ "versions": {
290
+ "leaderboard|arc:challenge|25": 0,
291
+ "leaderboard|gsm8k|5": 0,
292
+ "leaderboard|hellaswag|10": 0,
293
+ "leaderboard|mmlu:abstract_algebra|5": 0,
294
+ "leaderboard|mmlu:anatomy|5": 0,
295
+ "leaderboard|mmlu:astronomy|5": 0,
296
+ "leaderboard|mmlu:business_ethics|5": 0,
297
+ "leaderboard|mmlu:clinical_knowledge|5": 0,
298
+ "leaderboard|mmlu:college_biology|5": 0,
299
+ "leaderboard|mmlu:college_chemistry|5": 0,
300
+ "leaderboard|mmlu:college_computer_science|5": 0,
301
+ "leaderboard|mmlu:college_mathematics|5": 0,
302
+ "leaderboard|mmlu:college_medicine|5": 0,
303
+ "leaderboard|mmlu:college_physics|5": 0,
304
+ "leaderboard|mmlu:computer_security|5": 0,
305
+ "leaderboard|mmlu:conceptual_physics|5": 0,
306
+ "leaderboard|mmlu:econometrics|5": 0,
307
+ "leaderboard|mmlu:electrical_engineering|5": 0,
308
+ "leaderboard|mmlu:elementary_mathematics|5": 0,
309
+ "leaderboard|mmlu:formal_logic|5": 0,
310
+ "leaderboard|mmlu:global_facts|5": 0,
311
+ "leaderboard|mmlu:high_school_biology|5": 0,
312
+ "leaderboard|mmlu:high_school_chemistry|5": 0,
313
+ "leaderboard|mmlu:high_school_computer_science|5": 0,
314
+ "leaderboard|mmlu:high_school_european_history|5": 0,
315
+ "leaderboard|mmlu:high_school_geography|5": 0,
316
+ "leaderboard|mmlu:high_school_government_and_politics|5": 0,
317
+ "leaderboard|mmlu:high_school_macroeconomics|5": 0,
318
+ "leaderboard|mmlu:high_school_mathematics|5": 0,
319
+ "leaderboard|mmlu:high_school_microeconomics|5": 0,
320
+ "leaderboard|mmlu:high_school_physics|5": 0,
321
+ "leaderboard|mmlu:high_school_psychology|5": 0,
322
+ "leaderboard|mmlu:high_school_statistics|5": 0,
323
+ "leaderboard|mmlu:high_school_us_history|5": 0,
324
+ "leaderboard|mmlu:high_school_world_history|5": 0,
325
+ "leaderboard|mmlu:human_aging|5": 0,
326
+ "leaderboard|mmlu:human_sexuality|5": 0,
327
+ "leaderboard|mmlu:international_law|5": 0,
328
+ "leaderboard|mmlu:jurisprudence|5": 0,
329
+ "leaderboard|mmlu:logical_fallacies|5": 0,
330
+ "leaderboard|mmlu:machine_learning|5": 0,
331
+ "leaderboard|mmlu:management|5": 0,
332
+ "leaderboard|mmlu:marketing|5": 0,
333
+ "leaderboard|mmlu:medical_genetics|5": 0,
334
+ "leaderboard|mmlu:miscellaneous|5": 0,
335
+ "leaderboard|mmlu:moral_disputes|5": 0,
336
+ "leaderboard|mmlu:moral_scenarios|5": 0,
337
+ "leaderboard|mmlu:nutrition|5": 0,
338
+ "leaderboard|mmlu:philosophy|5": 0,
339
+ "leaderboard|mmlu:prehistory|5": 0,
340
+ "leaderboard|mmlu:professional_accounting|5": 0,
341
+ "leaderboard|mmlu:professional_law|5": 0,
342
+ "leaderboard|mmlu:professional_medicine|5": 0,
343
+ "leaderboard|mmlu:professional_psychology|5": 0,
344
+ "leaderboard|mmlu:public_relations|5": 0,
345
+ "leaderboard|mmlu:security_studies|5": 0,
346
+ "leaderboard|mmlu:sociology|5": 0,
347
+ "leaderboard|mmlu:us_foreign_policy|5": 0,
348
+ "leaderboard|mmlu:virology|5": 0,
349
+ "leaderboard|mmlu:world_religions|5": 0,
350
+ "leaderboard|truthfulqa:mc|0": 0,
351
+ "leaderboard|winogrande|5": 0
352
+ },
353
+ "config_tasks": {
354
+ "leaderboard|arc:challenge": {
355
+ "name": "arc:challenge",
356
+ "prompt_function": "arc",
357
+ "hf_repo": "ai2_arc",
358
+ "hf_subset": "ARC-Challenge",
359
+ "metric": [
360
+ "loglikelihood_acc",
361
+ "loglikelihood_acc_norm_nospace"
362
+ ],
363
+ "hf_avail_splits": [
364
+ "train",
365
+ "test"
366
+ ],
367
+ "evaluation_splits": [
368
+ "test"
369
+ ],
370
+ "few_shots_split": null,
371
+ "few_shots_select": "random_sampling_from_train",
372
+ "generation_size": 1,
373
+ "stop_sequence": [
374
+ "\n"
375
+ ],
376
+ "output_regex": null,
377
+ "frozen": false,
378
+ "suite": [
379
+ "leaderboard",
380
+ "arc"
381
+ ],
382
+ "original_num_docs": 1172,
383
+ "effective_num_docs": 1172,
384
+ "trust_dataset": true,
385
+ "must_remove_duplicate_docs": null
386
+ },
387
+ "leaderboard|gsm8k": {
388
+ "name": "gsm8k",
389
+ "prompt_function": "gsm8k",
390
+ "hf_repo": "gsm8k",
391
+ "hf_subset": "main",
392
+ "metric": [
393
+ "quasi_exact_match_gsm8k"
394
+ ],
395
+ "hf_avail_splits": [
396
+ "train",
397
+ "test"
398
+ ],
399
+ "evaluation_splits": [
400
+ "test"
401
+ ],
402
+ "few_shots_split": null,
403
+ "few_shots_select": "random_sampling_from_train",
404
+ "generation_size": 256,
405
+ "stop_sequence": [
406
+ "Question:",
407
+ "Question",
408
+ ":"
409
+ ],
410
+ "output_regex": null,
411
+ "frozen": false,
412
+ "suite": [
413
+ "leaderboard"
414
+ ],
415
+ "original_num_docs": 1319,
416
+ "effective_num_docs": 1319,
417
+ "trust_dataset": true,
418
+ "must_remove_duplicate_docs": null
419
+ },
420
+ "leaderboard|hellaswag": {
421
+ "name": "hellaswag",
422
+ "prompt_function": "hellaswag_harness",
423
+ "hf_repo": "hellaswag",
424
+ "hf_subset": "default",
425
+ "metric": [
426
+ "loglikelihood_acc",
427
+ "loglikelihood_acc_norm"
428
+ ],
429
+ "hf_avail_splits": [
430
+ "train",
431
+ "test",
432
+ "validation"
433
+ ],
434
+ "evaluation_splits": [
435
+ "validation"
436
+ ],
437
+ "few_shots_split": null,
438
+ "few_shots_select": "random_sampling_from_train",
439
+ "generation_size": -1,
440
+ "stop_sequence": [
441
+ "\n"
442
+ ],
443
+ "output_regex": null,
444
+ "frozen": false,
445
+ "suite": [
446
+ "leaderboard"
447
+ ],
448
+ "original_num_docs": 10042,
449
+ "effective_num_docs": 10042,
450
+ "trust_dataset": true,
451
+ "must_remove_duplicate_docs": null
452
+ },
453
+ "leaderboard|mmlu:abstract_algebra": {
454
+ "name": "mmlu:abstract_algebra",
455
+ "prompt_function": "mmlu_harness",
456
+ "hf_repo": "lighteval/mmlu",
457
+ "hf_subset": "abstract_algebra",
458
+ "metric": [
459
+ "loglikelihood_acc"
460
+ ],
461
+ "hf_avail_splits": [
462
+ "auxiliary_train",
463
+ "test",
464
+ "validation",
465
+ "dev"
466
+ ],
467
+ "evaluation_splits": [
468
+ "test"
469
+ ],
470
+ "few_shots_split": "dev",
471
+ "few_shots_select": "sequential",
472
+ "generation_size": 1,
473
+ "stop_sequence": [
474
+ "\n"
475
+ ],
476
+ "output_regex": null,
477
+ "frozen": false,
478
+ "suite": [
479
+ "leaderboard",
480
+ "mmlu"
481
+ ],
482
+ "original_num_docs": 100,
483
+ "effective_num_docs": 100,
484
+ "trust_dataset": true,
485
+ "must_remove_duplicate_docs": null
486
+ },
487
+ "leaderboard|mmlu:anatomy": {
488
+ "name": "mmlu:anatomy",
489
+ "prompt_function": "mmlu_harness",
490
+ "hf_repo": "lighteval/mmlu",
491
+ "hf_subset": "anatomy",
492
+ "metric": [
493
+ "loglikelihood_acc"
494
+ ],
495
+ "hf_avail_splits": [
496
+ "auxiliary_train",
497
+ "test",
498
+ "validation",
499
+ "dev"
500
+ ],
501
+ "evaluation_splits": [
502
+ "test"
503
+ ],
504
+ "few_shots_split": "dev",
505
+ "few_shots_select": "sequential",
506
+ "generation_size": 1,
507
+ "stop_sequence": [
508
+ "\n"
509
+ ],
510
+ "output_regex": null,
511
+ "frozen": false,
512
+ "suite": [
513
+ "leaderboard",
514
+ "mmlu"
515
+ ],
516
+ "original_num_docs": 135,
517
+ "effective_num_docs": 135,
518
+ "trust_dataset": true,
519
+ "must_remove_duplicate_docs": null
520
+ },
521
+ "leaderboard|mmlu:astronomy": {
522
+ "name": "mmlu:astronomy",
523
+ "prompt_function": "mmlu_harness",
524
+ "hf_repo": "lighteval/mmlu",
525
+ "hf_subset": "astronomy",
526
+ "metric": [
527
+ "loglikelihood_acc"
528
+ ],
529
+ "hf_avail_splits": [
530
+ "auxiliary_train",
531
+ "test",
532
+ "validation",
533
+ "dev"
534
+ ],
535
+ "evaluation_splits": [
536
+ "test"
537
+ ],
538
+ "few_shots_split": "dev",
539
+ "few_shots_select": "sequential",
540
+ "generation_size": 1,
541
+ "stop_sequence": [
542
+ "\n"
543
+ ],
544
+ "output_regex": null,
545
+ "frozen": false,
546
+ "suite": [
547
+ "leaderboard",
548
+ "mmlu"
549
+ ],
550
+ "original_num_docs": 152,
551
+ "effective_num_docs": 152,
552
+ "trust_dataset": true,
553
+ "must_remove_duplicate_docs": null
554
+ },
555
+ "leaderboard|mmlu:business_ethics": {
556
+ "name": "mmlu:business_ethics",
557
+ "prompt_function": "mmlu_harness",
558
+ "hf_repo": "lighteval/mmlu",
559
+ "hf_subset": "business_ethics",
560
+ "metric": [
561
+ "loglikelihood_acc"
562
+ ],
563
+ "hf_avail_splits": [
564
+ "auxiliary_train",
565
+ "test",
566
+ "validation",
567
+ "dev"
568
+ ],
569
+ "evaluation_splits": [
570
+ "test"
571
+ ],
572
+ "few_shots_split": "dev",
573
+ "few_shots_select": "sequential",
574
+ "generation_size": 1,
575
+ "stop_sequence": [
576
+ "\n"
577
+ ],
578
+ "output_regex": null,
579
+ "frozen": false,
580
+ "suite": [
581
+ "leaderboard",
582
+ "mmlu"
583
+ ],
584
+ "original_num_docs": 100,
585
+ "effective_num_docs": 100,
586
+ "trust_dataset": true,
587
+ "must_remove_duplicate_docs": null
588
+ },
589
+ "leaderboard|mmlu:clinical_knowledge": {
590
+ "name": "mmlu:clinical_knowledge",
591
+ "prompt_function": "mmlu_harness",
592
+ "hf_repo": "lighteval/mmlu",
593
+ "hf_subset": "clinical_knowledge",
594
+ "metric": [
595
+ "loglikelihood_acc"
596
+ ],
597
+ "hf_avail_splits": [
598
+ "auxiliary_train",
599
+ "test",
600
+ "validation",
601
+ "dev"
602
+ ],
603
+ "evaluation_splits": [
604
+ "test"
605
+ ],
606
+ "few_shots_split": "dev",
607
+ "few_shots_select": "sequential",
608
+ "generation_size": 1,
609
+ "stop_sequence": [
610
+ "\n"
611
+ ],
612
+ "output_regex": null,
613
+ "frozen": false,
614
+ "suite": [
615
+ "leaderboard",
616
+ "mmlu"
617
+ ],
618
+ "original_num_docs": 265,
619
+ "effective_num_docs": 265,
620
+ "trust_dataset": true,
621
+ "must_remove_duplicate_docs": null
622
+ },
623
+ "leaderboard|mmlu:college_biology": {
624
+ "name": "mmlu:college_biology",
625
+ "prompt_function": "mmlu_harness",
626
+ "hf_repo": "lighteval/mmlu",
627
+ "hf_subset": "college_biology",
628
+ "metric": [
629
+ "loglikelihood_acc"
630
+ ],
631
+ "hf_avail_splits": [
632
+ "auxiliary_train",
633
+ "test",
634
+ "validation",
635
+ "dev"
636
+ ],
637
+ "evaluation_splits": [
638
+ "test"
639
+ ],
640
+ "few_shots_split": "dev",
641
+ "few_shots_select": "sequential",
642
+ "generation_size": 1,
643
+ "stop_sequence": [
644
+ "\n"
645
+ ],
646
+ "output_regex": null,
647
+ "frozen": false,
648
+ "suite": [
649
+ "leaderboard",
650
+ "mmlu"
651
+ ],
652
+ "original_num_docs": 144,
653
+ "effective_num_docs": 144,
654
+ "trust_dataset": true,
655
+ "must_remove_duplicate_docs": null
656
+ },
657
+ "leaderboard|mmlu:college_chemistry": {
658
+ "name": "mmlu:college_chemistry",
659
+ "prompt_function": "mmlu_harness",
660
+ "hf_repo": "lighteval/mmlu",
661
+ "hf_subset": "college_chemistry",
662
+ "metric": [
663
+ "loglikelihood_acc"
664
+ ],
665
+ "hf_avail_splits": [
666
+ "auxiliary_train",
667
+ "test",
668
+ "validation",
669
+ "dev"
670
+ ],
671
+ "evaluation_splits": [
672
+ "test"
673
+ ],
674
+ "few_shots_split": "dev",
675
+ "few_shots_select": "sequential",
676
+ "generation_size": 1,
677
+ "stop_sequence": [
678
+ "\n"
679
+ ],
680
+ "output_regex": null,
681
+ "frozen": false,
682
+ "suite": [
683
+ "leaderboard",
684
+ "mmlu"
685
+ ],
686
+ "original_num_docs": 100,
687
+ "effective_num_docs": 100,
688
+ "trust_dataset": true,
689
+ "must_remove_duplicate_docs": null
690
+ },
691
+ "leaderboard|mmlu:college_computer_science": {
692
+ "name": "mmlu:college_computer_science",
693
+ "prompt_function": "mmlu_harness",
694
+ "hf_repo": "lighteval/mmlu",
695
+ "hf_subset": "college_computer_science",
696
+ "metric": [
697
+ "loglikelihood_acc"
698
+ ],
699
+ "hf_avail_splits": [
700
+ "auxiliary_train",
701
+ "test",
702
+ "validation",
703
+ "dev"
704
+ ],
705
+ "evaluation_splits": [
706
+ "test"
707
+ ],
708
+ "few_shots_split": "dev",
709
+ "few_shots_select": "sequential",
710
+ "generation_size": 1,
711
+ "stop_sequence": [
712
+ "\n"
713
+ ],
714
+ "output_regex": null,
715
+ "frozen": false,
716
+ "suite": [
717
+ "leaderboard",
718
+ "mmlu"
719
+ ],
720
+ "original_num_docs": 100,
721
+ "effective_num_docs": 100,
722
+ "trust_dataset": true,
723
+ "must_remove_duplicate_docs": null
724
+ },
725
+ "leaderboard|mmlu:college_mathematics": {
726
+ "name": "mmlu:college_mathematics",
727
+ "prompt_function": "mmlu_harness",
728
+ "hf_repo": "lighteval/mmlu",
729
+ "hf_subset": "college_mathematics",
730
+ "metric": [
731
+ "loglikelihood_acc"
732
+ ],
733
+ "hf_avail_splits": [
734
+ "auxiliary_train",
735
+ "test",
736
+ "validation",
737
+ "dev"
738
+ ],
739
+ "evaluation_splits": [
740
+ "test"
741
+ ],
742
+ "few_shots_split": "dev",
743
+ "few_shots_select": "sequential",
744
+ "generation_size": 1,
745
+ "stop_sequence": [
746
+ "\n"
747
+ ],
748
+ "output_regex": null,
749
+ "frozen": false,
750
+ "suite": [
751
+ "leaderboard",
752
+ "mmlu"
753
+ ],
754
+ "original_num_docs": 100,
755
+ "effective_num_docs": 100,
756
+ "trust_dataset": true,
757
+ "must_remove_duplicate_docs": null
758
+ },
759
+ "leaderboard|mmlu:college_medicine": {
760
+ "name": "mmlu:college_medicine",
761
+ "prompt_function": "mmlu_harness",
762
+ "hf_repo": "lighteval/mmlu",
763
+ "hf_subset": "college_medicine",
764
+ "metric": [
765
+ "loglikelihood_acc"
766
+ ],
767
+ "hf_avail_splits": [
768
+ "auxiliary_train",
769
+ "test",
770
+ "validation",
771
+ "dev"
772
+ ],
773
+ "evaluation_splits": [
774
+ "test"
775
+ ],
776
+ "few_shots_split": "dev",
777
+ "few_shots_select": "sequential",
778
+ "generation_size": 1,
779
+ "stop_sequence": [
780
+ "\n"
781
+ ],
782
+ "output_regex": null,
783
+ "frozen": false,
784
+ "suite": [
785
+ "leaderboard",
786
+ "mmlu"
787
+ ],
788
+ "original_num_docs": 173,
789
+ "effective_num_docs": 173,
790
+ "trust_dataset": true,
791
+ "must_remove_duplicate_docs": null
792
+ },
793
+ "leaderboard|mmlu:college_physics": {
794
+ "name": "mmlu:college_physics",
795
+ "prompt_function": "mmlu_harness",
796
+ "hf_repo": "lighteval/mmlu",
797
+ "hf_subset": "college_physics",
798
+ "metric": [
799
+ "loglikelihood_acc"
800
+ ],
801
+ "hf_avail_splits": [
802
+ "auxiliary_train",
803
+ "test",
804
+ "validation",
805
+ "dev"
806
+ ],
807
+ "evaluation_splits": [
808
+ "test"
809
+ ],
810
+ "few_shots_split": "dev",
811
+ "few_shots_select": "sequential",
812
+ "generation_size": 1,
813
+ "stop_sequence": [
814
+ "\n"
815
+ ],
816
+ "output_regex": null,
817
+ "frozen": false,
818
+ "suite": [
819
+ "leaderboard",
820
+ "mmlu"
821
+ ],
822
+ "original_num_docs": 102,
823
+ "effective_num_docs": 102,
824
+ "trust_dataset": true,
825
+ "must_remove_duplicate_docs": null
826
+ },
827
+ "leaderboard|mmlu:computer_security": {
828
+ "name": "mmlu:computer_security",
829
+ "prompt_function": "mmlu_harness",
830
+ "hf_repo": "lighteval/mmlu",
831
+ "hf_subset": "computer_security",
832
+ "metric": [
833
+ "loglikelihood_acc"
834
+ ],
835
+ "hf_avail_splits": [
836
+ "auxiliary_train",
837
+ "test",
838
+ "validation",
839
+ "dev"
840
+ ],
841
+ "evaluation_splits": [
842
+ "test"
843
+ ],
844
+ "few_shots_split": "dev",
845
+ "few_shots_select": "sequential",
846
+ "generation_size": 1,
847
+ "stop_sequence": [
848
+ "\n"
849
+ ],
850
+ "output_regex": null,
851
+ "frozen": false,
852
+ "suite": [
853
+ "leaderboard",
854
+ "mmlu"
855
+ ],
856
+ "original_num_docs": 100,
857
+ "effective_num_docs": 100,
858
+ "trust_dataset": true,
859
+ "must_remove_duplicate_docs": null
860
+ },
861
+ "leaderboard|mmlu:conceptual_physics": {
862
+ "name": "mmlu:conceptual_physics",
863
+ "prompt_function": "mmlu_harness",
864
+ "hf_repo": "lighteval/mmlu",
865
+ "hf_subset": "conceptual_physics",
866
+ "metric": [
867
+ "loglikelihood_acc"
868
+ ],
869
+ "hf_avail_splits": [
870
+ "auxiliary_train",
871
+ "test",
872
+ "validation",
873
+ "dev"
874
+ ],
875
+ "evaluation_splits": [
876
+ "test"
877
+ ],
878
+ "few_shots_split": "dev",
879
+ "few_shots_select": "sequential",
880
+ "generation_size": 1,
881
+ "stop_sequence": [
882
+ "\n"
883
+ ],
884
+ "output_regex": null,
885
+ "frozen": false,
886
+ "suite": [
887
+ "leaderboard",
888
+ "mmlu"
889
+ ],
890
+ "original_num_docs": 235,
891
+ "effective_num_docs": 235,
892
+ "trust_dataset": true,
893
+ "must_remove_duplicate_docs": null
894
+ },
895
+ "leaderboard|mmlu:econometrics": {
896
+ "name": "mmlu:econometrics",
897
+ "prompt_function": "mmlu_harness",
898
+ "hf_repo": "lighteval/mmlu",
899
+ "hf_subset": "econometrics",
900
+ "metric": [
901
+ "loglikelihood_acc"
902
+ ],
903
+ "hf_avail_splits": [
904
+ "auxiliary_train",
905
+ "test",
906
+ "validation",
907
+ "dev"
908
+ ],
909
+ "evaluation_splits": [
910
+ "test"
911
+ ],
912
+ "few_shots_split": "dev",
913
+ "few_shots_select": "sequential",
914
+ "generation_size": 1,
915
+ "stop_sequence": [
916
+ "\n"
917
+ ],
918
+ "output_regex": null,
919
+ "frozen": false,
920
+ "suite": [
921
+ "leaderboard",
922
+ "mmlu"
923
+ ],
924
+ "original_num_docs": 114,
925
+ "effective_num_docs": 114,
926
+ "trust_dataset": true,
927
+ "must_remove_duplicate_docs": null
928
+ },
929
+ "leaderboard|mmlu:electrical_engineering": {
930
+ "name": "mmlu:electrical_engineering",
931
+ "prompt_function": "mmlu_harness",
932
+ "hf_repo": "lighteval/mmlu",
933
+ "hf_subset": "electrical_engineering",
934
+ "metric": [
935
+ "loglikelihood_acc"
936
+ ],
937
+ "hf_avail_splits": [
938
+ "auxiliary_train",
939
+ "test",
940
+ "validation",
941
+ "dev"
942
+ ],
943
+ "evaluation_splits": [
944
+ "test"
945
+ ],
946
+ "few_shots_split": "dev",
947
+ "few_shots_select": "sequential",
948
+ "generation_size": 1,
949
+ "stop_sequence": [
950
+ "\n"
951
+ ],
952
+ "output_regex": null,
953
+ "frozen": false,
954
+ "suite": [
955
+ "leaderboard",
956
+ "mmlu"
957
+ ],
958
+ "original_num_docs": 145,
959
+ "effective_num_docs": 145,
960
+ "trust_dataset": true,
961
+ "must_remove_duplicate_docs": null
962
+ },
963
+ "leaderboard|mmlu:elementary_mathematics": {
964
+ "name": "mmlu:elementary_mathematics",
965
+ "prompt_function": "mmlu_harness",
966
+ "hf_repo": "lighteval/mmlu",
967
+ "hf_subset": "elementary_mathematics",
968
+ "metric": [
969
+ "loglikelihood_acc"
970
+ ],
971
+ "hf_avail_splits": [
972
+ "auxiliary_train",
973
+ "test",
974
+ "validation",
975
+ "dev"
976
+ ],
977
+ "evaluation_splits": [
978
+ "test"
979
+ ],
980
+ "few_shots_split": "dev",
981
+ "few_shots_select": "sequential",
982
+ "generation_size": 1,
983
+ "stop_sequence": [
984
+ "\n"
985
+ ],
986
+ "output_regex": null,
987
+ "frozen": false,
988
+ "suite": [
989
+ "leaderboard",
990
+ "mmlu"
991
+ ],
992
+ "original_num_docs": 378,
993
+ "effective_num_docs": 378,
994
+ "trust_dataset": true,
995
+ "must_remove_duplicate_docs": null
996
+ },
997
+ "leaderboard|mmlu:formal_logic": {
998
+ "name": "mmlu:formal_logic",
999
+ "prompt_function": "mmlu_harness",
1000
+ "hf_repo": "lighteval/mmlu",
1001
+ "hf_subset": "formal_logic",
1002
+ "metric": [
1003
+ "loglikelihood_acc"
1004
+ ],
1005
+ "hf_avail_splits": [
1006
+ "auxiliary_train",
1007
+ "test",
1008
+ "validation",
1009
+ "dev"
1010
+ ],
1011
+ "evaluation_splits": [
1012
+ "test"
1013
+ ],
1014
+ "few_shots_split": "dev",
1015
+ "few_shots_select": "sequential",
1016
+ "generation_size": 1,
1017
+ "stop_sequence": [
1018
+ "\n"
1019
+ ],
1020
+ "output_regex": null,
1021
+ "frozen": false,
1022
+ "suite": [
1023
+ "leaderboard",
1024
+ "mmlu"
1025
+ ],
1026
+ "original_num_docs": 126,
1027
+ "effective_num_docs": 126,
1028
+ "trust_dataset": true,
1029
+ "must_remove_duplicate_docs": null
1030
+ },
1031
+ "leaderboard|mmlu:global_facts": {
1032
+ "name": "mmlu:global_facts",
1033
+ "prompt_function": "mmlu_harness",
1034
+ "hf_repo": "lighteval/mmlu",
1035
+ "hf_subset": "global_facts",
1036
+ "metric": [
1037
+ "loglikelihood_acc"
1038
+ ],
1039
+ "hf_avail_splits": [
1040
+ "auxiliary_train",
1041
+ "test",
1042
+ "validation",
1043
+ "dev"
1044
+ ],
1045
+ "evaluation_splits": [
1046
+ "test"
1047
+ ],
1048
+ "few_shots_split": "dev",
1049
+ "few_shots_select": "sequential",
1050
+ "generation_size": 1,
1051
+ "stop_sequence": [
1052
+ "\n"
1053
+ ],
1054
+ "output_regex": null,
1055
+ "frozen": false,
1056
+ "suite": [
1057
+ "leaderboard",
1058
+ "mmlu"
1059
+ ],
1060
+ "original_num_docs": 100,
1061
+ "effective_num_docs": 100,
1062
+ "trust_dataset": true,
1063
+ "must_remove_duplicate_docs": null
1064
+ },
1065
+ "leaderboard|mmlu:high_school_biology": {
1066
+ "name": "mmlu:high_school_biology",
1067
+ "prompt_function": "mmlu_harness",
1068
+ "hf_repo": "lighteval/mmlu",
1069
+ "hf_subset": "high_school_biology",
1070
+ "metric": [
1071
+ "loglikelihood_acc"
1072
+ ],
1073
+ "hf_avail_splits": [
1074
+ "auxiliary_train",
1075
+ "test",
1076
+ "validation",
1077
+ "dev"
1078
+ ],
1079
+ "evaluation_splits": [
1080
+ "test"
1081
+ ],
1082
+ "few_shots_split": "dev",
1083
+ "few_shots_select": "sequential",
1084
+ "generation_size": 1,
1085
+ "stop_sequence": [
1086
+ "\n"
1087
+ ],
1088
+ "output_regex": null,
1089
+ "frozen": false,
1090
+ "suite": [
1091
+ "leaderboard",
1092
+ "mmlu"
1093
+ ],
1094
+ "original_num_docs": 310,
1095
+ "effective_num_docs": 310,
1096
+ "trust_dataset": true,
1097
+ "must_remove_duplicate_docs": null
1098
+ },
1099
+ "leaderboard|mmlu:high_school_chemistry": {
1100
+ "name": "mmlu:high_school_chemistry",
1101
+ "prompt_function": "mmlu_harness",
1102
+ "hf_repo": "lighteval/mmlu",
1103
+ "hf_subset": "high_school_chemistry",
1104
+ "metric": [
1105
+ "loglikelihood_acc"
1106
+ ],
1107
+ "hf_avail_splits": [
1108
+ "auxiliary_train",
1109
+ "test",
1110
+ "validation",
1111
+ "dev"
1112
+ ],
1113
+ "evaluation_splits": [
1114
+ "test"
1115
+ ],
1116
+ "few_shots_split": "dev",
1117
+ "few_shots_select": "sequential",
1118
+ "generation_size": 1,
1119
+ "stop_sequence": [
1120
+ "\n"
1121
+ ],
1122
+ "output_regex": null,
1123
+ "frozen": false,
1124
+ "suite": [
1125
+ "leaderboard",
1126
+ "mmlu"
1127
+ ],
1128
+ "original_num_docs": 203,
1129
+ "effective_num_docs": 203,
1130
+ "trust_dataset": true,
1131
+ "must_remove_duplicate_docs": null
1132
+ },
1133
+ "leaderboard|mmlu:high_school_computer_science": {
1134
+ "name": "mmlu:high_school_computer_science",
1135
+ "prompt_function": "mmlu_harness",
1136
+ "hf_repo": "lighteval/mmlu",
1137
+ "hf_subset": "high_school_computer_science",
1138
+ "metric": [
1139
+ "loglikelihood_acc"
1140
+ ],
1141
+ "hf_avail_splits": [
1142
+ "auxiliary_train",
1143
+ "test",
1144
+ "validation",
1145
+ "dev"
1146
+ ],
1147
+ "evaluation_splits": [
1148
+ "test"
1149
+ ],
1150
+ "few_shots_split": "dev",
1151
+ "few_shots_select": "sequential",
1152
+ "generation_size": 1,
1153
+ "stop_sequence": [
1154
+ "\n"
1155
+ ],
1156
+ "output_regex": null,
1157
+ "frozen": false,
1158
+ "suite": [
1159
+ "leaderboard",
1160
+ "mmlu"
1161
+ ],
1162
+ "original_num_docs": 100,
1163
+ "effective_num_docs": 100,
1164
+ "trust_dataset": true,
1165
+ "must_remove_duplicate_docs": null
1166
+ },
1167
+ "leaderboard|mmlu:high_school_european_history": {
1168
+ "name": "mmlu:high_school_european_history",
1169
+ "prompt_function": "mmlu_harness",
1170
+ "hf_repo": "lighteval/mmlu",
1171
+ "hf_subset": "high_school_european_history",
1172
+ "metric": [
1173
+ "loglikelihood_acc"
1174
+ ],
1175
+ "hf_avail_splits": [
1176
+ "auxiliary_train",
1177
+ "test",
1178
+ "validation",
1179
+ "dev"
1180
+ ],
1181
+ "evaluation_splits": [
1182
+ "test"
1183
+ ],
1184
+ "few_shots_split": "dev",
1185
+ "few_shots_select": "sequential",
1186
+ "generation_size": 1,
1187
+ "stop_sequence": [
1188
+ "\n"
1189
+ ],
1190
+ "output_regex": null,
1191
+ "frozen": false,
1192
+ "suite": [
1193
+ "leaderboard",
1194
+ "mmlu"
1195
+ ],
1196
+ "original_num_docs": 165,
1197
+ "effective_num_docs": 165,
1198
+ "trust_dataset": true,
1199
+ "must_remove_duplicate_docs": null
1200
+ },
1201
+ "leaderboard|mmlu:high_school_geography": {
1202
+ "name": "mmlu:high_school_geography",
1203
+ "prompt_function": "mmlu_harness",
1204
+ "hf_repo": "lighteval/mmlu",
1205
+ "hf_subset": "high_school_geography",
1206
+ "metric": [
1207
+ "loglikelihood_acc"
1208
+ ],
1209
+ "hf_avail_splits": [
1210
+ "auxiliary_train",
1211
+ "test",
1212
+ "validation",
1213
+ "dev"
1214
+ ],
1215
+ "evaluation_splits": [
1216
+ "test"
1217
+ ],
1218
+ "few_shots_split": "dev",
1219
+ "few_shots_select": "sequential",
1220
+ "generation_size": 1,
1221
+ "stop_sequence": [
1222
+ "\n"
1223
+ ],
1224
+ "output_regex": null,
1225
+ "frozen": false,
1226
+ "suite": [
1227
+ "leaderboard",
1228
+ "mmlu"
1229
+ ],
1230
+ "original_num_docs": 198,
1231
+ "effective_num_docs": 198,
1232
+ "trust_dataset": true,
1233
+ "must_remove_duplicate_docs": null
1234
+ },
1235
+ "leaderboard|mmlu:high_school_government_and_politics": {
1236
+ "name": "mmlu:high_school_government_and_politics",
1237
+ "prompt_function": "mmlu_harness",
1238
+ "hf_repo": "lighteval/mmlu",
1239
+ "hf_subset": "high_school_government_and_politics",
1240
+ "metric": [
1241
+ "loglikelihood_acc"
1242
+ ],
1243
+ "hf_avail_splits": [
1244
+ "auxiliary_train",
1245
+ "test",
1246
+ "validation",
1247
+ "dev"
1248
+ ],
1249
+ "evaluation_splits": [
1250
+ "test"
1251
+ ],
1252
+ "few_shots_split": "dev",
1253
+ "few_shots_select": "sequential",
1254
+ "generation_size": 1,
1255
+ "stop_sequence": [
1256
+ "\n"
1257
+ ],
1258
+ "output_regex": null,
1259
+ "frozen": false,
1260
+ "suite": [
1261
+ "leaderboard",
1262
+ "mmlu"
1263
+ ],
1264
+ "original_num_docs": 193,
1265
+ "effective_num_docs": 193,
1266
+ "trust_dataset": true,
1267
+ "must_remove_duplicate_docs": null
1268
+ },
1269
+ "leaderboard|mmlu:high_school_macroeconomics": {
1270
+ "name": "mmlu:high_school_macroeconomics",
1271
+ "prompt_function": "mmlu_harness",
1272
+ "hf_repo": "lighteval/mmlu",
1273
+ "hf_subset": "high_school_macroeconomics",
1274
+ "metric": [
1275
+ "loglikelihood_acc"
1276
+ ],
1277
+ "hf_avail_splits": [
1278
+ "auxiliary_train",
1279
+ "test",
1280
+ "validation",
1281
+ "dev"
1282
+ ],
1283
+ "evaluation_splits": [
1284
+ "test"
1285
+ ],
1286
+ "few_shots_split": "dev",
1287
+ "few_shots_select": "sequential",
1288
+ "generation_size": 1,
1289
+ "stop_sequence": [
1290
+ "\n"
1291
+ ],
1292
+ "output_regex": null,
1293
+ "frozen": false,
1294
+ "suite": [
1295
+ "leaderboard",
1296
+ "mmlu"
1297
+ ],
1298
+ "original_num_docs": 390,
1299
+ "effective_num_docs": 390,
1300
+ "trust_dataset": true,
1301
+ "must_remove_duplicate_docs": null
1302
+ },
1303
+ "leaderboard|mmlu:high_school_mathematics": {
1304
+ "name": "mmlu:high_school_mathematics",
1305
+ "prompt_function": "mmlu_harness",
1306
+ "hf_repo": "lighteval/mmlu",
1307
+ "hf_subset": "high_school_mathematics",
1308
+ "metric": [
1309
+ "loglikelihood_acc"
1310
+ ],
1311
+ "hf_avail_splits": [
1312
+ "auxiliary_train",
1313
+ "test",
1314
+ "validation",
1315
+ "dev"
1316
+ ],
1317
+ "evaluation_splits": [
1318
+ "test"
1319
+ ],
1320
+ "few_shots_split": "dev",
1321
+ "few_shots_select": "sequential",
1322
+ "generation_size": 1,
1323
+ "stop_sequence": [
1324
+ "\n"
1325
+ ],
1326
+ "output_regex": null,
1327
+ "frozen": false,
1328
+ "suite": [
1329
+ "leaderboard",
1330
+ "mmlu"
1331
+ ],
1332
+ "original_num_docs": 270,
1333
+ "effective_num_docs": 270,
1334
+ "trust_dataset": true,
1335
+ "must_remove_duplicate_docs": null
1336
+ },
1337
+ "leaderboard|mmlu:high_school_microeconomics": {
1338
+ "name": "mmlu:high_school_microeconomics",
1339
+ "prompt_function": "mmlu_harness",
1340
+ "hf_repo": "lighteval/mmlu",
1341
+ "hf_subset": "high_school_microeconomics",
1342
+ "metric": [
1343
+ "loglikelihood_acc"
1344
+ ],
1345
+ "hf_avail_splits": [
1346
+ "auxiliary_train",
1347
+ "test",
1348
+ "validation",
1349
+ "dev"
1350
+ ],
1351
+ "evaluation_splits": [
1352
+ "test"
1353
+ ],
1354
+ "few_shots_split": "dev",
1355
+ "few_shots_select": "sequential",
1356
+ "generation_size": 1,
1357
+ "stop_sequence": [
1358
+ "\n"
1359
+ ],
1360
+ "output_regex": null,
1361
+ "frozen": false,
1362
+ "suite": [
1363
+ "leaderboard",
1364
+ "mmlu"
1365
+ ],
1366
+ "original_num_docs": 238,
1367
+ "effective_num_docs": 238,
1368
+ "trust_dataset": true,
1369
+ "must_remove_duplicate_docs": null
1370
+ },
1371
+ "leaderboard|mmlu:high_school_physics": {
1372
+ "name": "mmlu:high_school_physics",
1373
+ "prompt_function": "mmlu_harness",
1374
+ "hf_repo": "lighteval/mmlu",
1375
+ "hf_subset": "high_school_physics",
1376
+ "metric": [
1377
+ "loglikelihood_acc"
1378
+ ],
1379
+ "hf_avail_splits": [
1380
+ "auxiliary_train",
1381
+ "test",
1382
+ "validation",
1383
+ "dev"
1384
+ ],
1385
+ "evaluation_splits": [
1386
+ "test"
1387
+ ],
1388
+ "few_shots_split": "dev",
1389
+ "few_shots_select": "sequential",
1390
+ "generation_size": 1,
1391
+ "stop_sequence": [
1392
+ "\n"
1393
+ ],
1394
+ "output_regex": null,
1395
+ "frozen": false,
1396
+ "suite": [
1397
+ "leaderboard",
1398
+ "mmlu"
1399
+ ],
1400
+ "original_num_docs": 151,
1401
+ "effective_num_docs": 151,
1402
+ "trust_dataset": true,
1403
+ "must_remove_duplicate_docs": null
1404
+ },
1405
+ "leaderboard|mmlu:high_school_psychology": {
1406
+ "name": "mmlu:high_school_psychology",
1407
+ "prompt_function": "mmlu_harness",
1408
+ "hf_repo": "lighteval/mmlu",
1409
+ "hf_subset": "high_school_psychology",
1410
+ "metric": [
1411
+ "loglikelihood_acc"
1412
+ ],
1413
+ "hf_avail_splits": [
1414
+ "auxiliary_train",
1415
+ "test",
1416
+ "validation",
1417
+ "dev"
1418
+ ],
1419
+ "evaluation_splits": [
1420
+ "test"
1421
+ ],
1422
+ "few_shots_split": "dev",
1423
+ "few_shots_select": "sequential",
1424
+ "generation_size": 1,
1425
+ "stop_sequence": [
1426
+ "\n"
1427
+ ],
1428
+ "output_regex": null,
1429
+ "frozen": false,
1430
+ "suite": [
1431
+ "leaderboard",
1432
+ "mmlu"
1433
+ ],
1434
+ "original_num_docs": 545,
1435
+ "effective_num_docs": 545,
1436
+ "trust_dataset": true,
1437
+ "must_remove_duplicate_docs": null
1438
+ },
1439
+ "leaderboard|mmlu:high_school_statistics": {
1440
+ "name": "mmlu:high_school_statistics",
1441
+ "prompt_function": "mmlu_harness",
1442
+ "hf_repo": "lighteval/mmlu",
1443
+ "hf_subset": "high_school_statistics",
1444
+ "metric": [
1445
+ "loglikelihood_acc"
1446
+ ],
1447
+ "hf_avail_splits": [
1448
+ "auxiliary_train",
1449
+ "test",
1450
+ "validation",
1451
+ "dev"
1452
+ ],
1453
+ "evaluation_splits": [
1454
+ "test"
1455
+ ],
1456
+ "few_shots_split": "dev",
1457
+ "few_shots_select": "sequential",
1458
+ "generation_size": 1,
1459
+ "stop_sequence": [
1460
+ "\n"
1461
+ ],
1462
+ "output_regex": null,
1463
+ "frozen": false,
1464
+ "suite": [
1465
+ "leaderboard",
1466
+ "mmlu"
1467
+ ],
1468
+ "original_num_docs": 216,
1469
+ "effective_num_docs": 216,
1470
+ "trust_dataset": true,
1471
+ "must_remove_duplicate_docs": null
1472
+ },
1473
+ "leaderboard|mmlu:high_school_us_history": {
1474
+ "name": "mmlu:high_school_us_history",
1475
+ "prompt_function": "mmlu_harness",
1476
+ "hf_repo": "lighteval/mmlu",
1477
+ "hf_subset": "high_school_us_history",
1478
+ "metric": [
1479
+ "loglikelihood_acc"
1480
+ ],
1481
+ "hf_avail_splits": [
1482
+ "auxiliary_train",
1483
+ "test",
1484
+ "validation",
1485
+ "dev"
1486
+ ],
1487
+ "evaluation_splits": [
1488
+ "test"
1489
+ ],
1490
+ "few_shots_split": "dev",
1491
+ "few_shots_select": "sequential",
1492
+ "generation_size": 1,
1493
+ "stop_sequence": [
1494
+ "\n"
1495
+ ],
1496
+ "output_regex": null,
1497
+ "frozen": false,
1498
+ "suite": [
1499
+ "leaderboard",
1500
+ "mmlu"
1501
+ ],
1502
+ "original_num_docs": 204,
1503
+ "effective_num_docs": 204,
1504
+ "trust_dataset": true,
1505
+ "must_remove_duplicate_docs": null
1506
+ },
1507
+ "leaderboard|mmlu:high_school_world_history": {
1508
+ "name": "mmlu:high_school_world_history",
1509
+ "prompt_function": "mmlu_harness",
1510
+ "hf_repo": "lighteval/mmlu",
1511
+ "hf_subset": "high_school_world_history",
1512
+ "metric": [
1513
+ "loglikelihood_acc"
1514
+ ],
1515
+ "hf_avail_splits": [
1516
+ "auxiliary_train",
1517
+ "test",
1518
+ "validation",
1519
+ "dev"
1520
+ ],
1521
+ "evaluation_splits": [
1522
+ "test"
1523
+ ],
1524
+ "few_shots_split": "dev",
1525
+ "few_shots_select": "sequential",
1526
+ "generation_size": 1,
1527
+ "stop_sequence": [
1528
+ "\n"
1529
+ ],
1530
+ "output_regex": null,
1531
+ "frozen": false,
1532
+ "suite": [
1533
+ "leaderboard",
1534
+ "mmlu"
1535
+ ],
1536
+ "original_num_docs": 237,
1537
+ "effective_num_docs": 237,
1538
+ "trust_dataset": true,
1539
+ "must_remove_duplicate_docs": null
1540
+ },
1541
+ "leaderboard|mmlu:human_aging": {
1542
+ "name": "mmlu:human_aging",
1543
+ "prompt_function": "mmlu_harness",
1544
+ "hf_repo": "lighteval/mmlu",
1545
+ "hf_subset": "human_aging",
1546
+ "metric": [
1547
+ "loglikelihood_acc"
1548
+ ],
1549
+ "hf_avail_splits": [
1550
+ "auxiliary_train",
1551
+ "test",
1552
+ "validation",
1553
+ "dev"
1554
+ ],
1555
+ "evaluation_splits": [
1556
+ "test"
1557
+ ],
1558
+ "few_shots_split": "dev",
1559
+ "few_shots_select": "sequential",
1560
+ "generation_size": 1,
1561
+ "stop_sequence": [
1562
+ "\n"
1563
+ ],
1564
+ "output_regex": null,
1565
+ "frozen": false,
1566
+ "suite": [
1567
+ "leaderboard",
1568
+ "mmlu"
1569
+ ],
1570
+ "original_num_docs": 223,
1571
+ "effective_num_docs": 223,
1572
+ "trust_dataset": true,
1573
+ "must_remove_duplicate_docs": null
1574
+ },
1575
+ "leaderboard|mmlu:human_sexuality": {
1576
+ "name": "mmlu:human_sexuality",
1577
+ "prompt_function": "mmlu_harness",
1578
+ "hf_repo": "lighteval/mmlu",
1579
+ "hf_subset": "human_sexuality",
1580
+ "metric": [
1581
+ "loglikelihood_acc"
1582
+ ],
1583
+ "hf_avail_splits": [
1584
+ "auxiliary_train",
1585
+ "test",
1586
+ "validation",
1587
+ "dev"
1588
+ ],
1589
+ "evaluation_splits": [
1590
+ "test"
1591
+ ],
1592
+ "few_shots_split": "dev",
1593
+ "few_shots_select": "sequential",
1594
+ "generation_size": 1,
1595
+ "stop_sequence": [
1596
+ "\n"
1597
+ ],
1598
+ "output_regex": null,
1599
+ "frozen": false,
1600
+ "suite": [
1601
+ "leaderboard",
1602
+ "mmlu"
1603
+ ],
1604
+ "original_num_docs": 131,
1605
+ "effective_num_docs": 131,
1606
+ "trust_dataset": true,
1607
+ "must_remove_duplicate_docs": null
1608
+ },
1609
+ "leaderboard|mmlu:international_law": {
1610
+ "name": "mmlu:international_law",
1611
+ "prompt_function": "mmlu_harness",
1612
+ "hf_repo": "lighteval/mmlu",
1613
+ "hf_subset": "international_law",
1614
+ "metric": [
1615
+ "loglikelihood_acc"
1616
+ ],
1617
+ "hf_avail_splits": [
1618
+ "auxiliary_train",
1619
+ "test",
1620
+ "validation",
1621
+ "dev"
1622
+ ],
1623
+ "evaluation_splits": [
1624
+ "test"
1625
+ ],
1626
+ "few_shots_split": "dev",
1627
+ "few_shots_select": "sequential",
1628
+ "generation_size": 1,
1629
+ "stop_sequence": [
1630
+ "\n"
1631
+ ],
1632
+ "output_regex": null,
1633
+ "frozen": false,
1634
+ "suite": [
1635
+ "leaderboard",
1636
+ "mmlu"
1637
+ ],
1638
+ "original_num_docs": 121,
1639
+ "effective_num_docs": 121,
1640
+ "trust_dataset": true,
1641
+ "must_remove_duplicate_docs": null
1642
+ },
1643
+ "leaderboard|mmlu:jurisprudence": {
1644
+ "name": "mmlu:jurisprudence",
1645
+ "prompt_function": "mmlu_harness",
1646
+ "hf_repo": "lighteval/mmlu",
1647
+ "hf_subset": "jurisprudence",
1648
+ "metric": [
1649
+ "loglikelihood_acc"
1650
+ ],
1651
+ "hf_avail_splits": [
1652
+ "auxiliary_train",
1653
+ "test",
1654
+ "validation",
1655
+ "dev"
1656
+ ],
1657
+ "evaluation_splits": [
1658
+ "test"
1659
+ ],
1660
+ "few_shots_split": "dev",
1661
+ "few_shots_select": "sequential",
1662
+ "generation_size": 1,
1663
+ "stop_sequence": [
1664
+ "\n"
1665
+ ],
1666
+ "output_regex": null,
1667
+ "frozen": false,
1668
+ "suite": [
1669
+ "leaderboard",
1670
+ "mmlu"
1671
+ ],
1672
+ "original_num_docs": 108,
1673
+ "effective_num_docs": 108,
1674
+ "trust_dataset": true,
1675
+ "must_remove_duplicate_docs": null
1676
+ },
1677
+ "leaderboard|mmlu:logical_fallacies": {
1678
+ "name": "mmlu:logical_fallacies",
1679
+ "prompt_function": "mmlu_harness",
1680
+ "hf_repo": "lighteval/mmlu",
1681
+ "hf_subset": "logical_fallacies",
1682
+ "metric": [
1683
+ "loglikelihood_acc"
1684
+ ],
1685
+ "hf_avail_splits": [
1686
+ "auxiliary_train",
1687
+ "test",
1688
+ "validation",
1689
+ "dev"
1690
+ ],
1691
+ "evaluation_splits": [
1692
+ "test"
1693
+ ],
1694
+ "few_shots_split": "dev",
1695
+ "few_shots_select": "sequential",
1696
+ "generation_size": 1,
1697
+ "stop_sequence": [
1698
+ "\n"
1699
+ ],
1700
+ "output_regex": null,
1701
+ "frozen": false,
1702
+ "suite": [
1703
+ "leaderboard",
1704
+ "mmlu"
1705
+ ],
1706
+ "original_num_docs": 163,
1707
+ "effective_num_docs": 163,
1708
+ "trust_dataset": true,
1709
+ "must_remove_duplicate_docs": null
1710
+ },
1711
+ "leaderboard|mmlu:machine_learning": {
1712
+ "name": "mmlu:machine_learning",
1713
+ "prompt_function": "mmlu_harness",
1714
+ "hf_repo": "lighteval/mmlu",
1715
+ "hf_subset": "machine_learning",
1716
+ "metric": [
1717
+ "loglikelihood_acc"
1718
+ ],
1719
+ "hf_avail_splits": [
1720
+ "auxiliary_train",
1721
+ "test",
1722
+ "validation",
1723
+ "dev"
1724
+ ],
1725
+ "evaluation_splits": [
1726
+ "test"
1727
+ ],
1728
+ "few_shots_split": "dev",
1729
+ "few_shots_select": "sequential",
1730
+ "generation_size": 1,
1731
+ "stop_sequence": [
1732
+ "\n"
1733
+ ],
1734
+ "output_regex": null,
1735
+ "frozen": false,
1736
+ "suite": [
1737
+ "leaderboard",
1738
+ "mmlu"
1739
+ ],
1740
+ "original_num_docs": 112,
1741
+ "effective_num_docs": 112,
1742
+ "trust_dataset": true,
1743
+ "must_remove_duplicate_docs": null
1744
+ },
1745
+ "leaderboard|mmlu:management": {
1746
+ "name": "mmlu:management",
1747
+ "prompt_function": "mmlu_harness",
1748
+ "hf_repo": "lighteval/mmlu",
1749
+ "hf_subset": "management",
1750
+ "metric": [
1751
+ "loglikelihood_acc"
1752
+ ],
1753
+ "hf_avail_splits": [
1754
+ "auxiliary_train",
1755
+ "test",
1756
+ "validation",
1757
+ "dev"
1758
+ ],
1759
+ "evaluation_splits": [
1760
+ "test"
1761
+ ],
1762
+ "few_shots_split": "dev",
1763
+ "few_shots_select": "sequential",
1764
+ "generation_size": 1,
1765
+ "stop_sequence": [
1766
+ "\n"
1767
+ ],
1768
+ "output_regex": null,
1769
+ "frozen": false,
1770
+ "suite": [
1771
+ "leaderboard",
1772
+ "mmlu"
1773
+ ],
1774
+ "original_num_docs": 103,
1775
+ "effective_num_docs": 103,
1776
+ "trust_dataset": true,
1777
+ "must_remove_duplicate_docs": null
1778
+ },
1779
+ "leaderboard|mmlu:marketing": {
1780
+ "name": "mmlu:marketing",
1781
+ "prompt_function": "mmlu_harness",
1782
+ "hf_repo": "lighteval/mmlu",
1783
+ "hf_subset": "marketing",
1784
+ "metric": [
1785
+ "loglikelihood_acc"
1786
+ ],
1787
+ "hf_avail_splits": [
1788
+ "auxiliary_train",
1789
+ "test",
1790
+ "validation",
1791
+ "dev"
1792
+ ],
1793
+ "evaluation_splits": [
1794
+ "test"
1795
+ ],
1796
+ "few_shots_split": "dev",
1797
+ "few_shots_select": "sequential",
1798
+ "generation_size": 1,
1799
+ "stop_sequence": [
1800
+ "\n"
1801
+ ],
1802
+ "output_regex": null,
1803
+ "frozen": false,
1804
+ "suite": [
1805
+ "leaderboard",
1806
+ "mmlu"
1807
+ ],
1808
+ "original_num_docs": 234,
1809
+ "effective_num_docs": 234,
1810
+ "trust_dataset": true,
1811
+ "must_remove_duplicate_docs": null
1812
+ },
1813
+ "leaderboard|mmlu:medical_genetics": {
1814
+ "name": "mmlu:medical_genetics",
1815
+ "prompt_function": "mmlu_harness",
1816
+ "hf_repo": "lighteval/mmlu",
1817
+ "hf_subset": "medical_genetics",
1818
+ "metric": [
1819
+ "loglikelihood_acc"
1820
+ ],
1821
+ "hf_avail_splits": [
1822
+ "auxiliary_train",
1823
+ "test",
1824
+ "validation",
1825
+ "dev"
1826
+ ],
1827
+ "evaluation_splits": [
1828
+ "test"
1829
+ ],
1830
+ "few_shots_split": "dev",
1831
+ "few_shots_select": "sequential",
1832
+ "generation_size": 1,
1833
+ "stop_sequence": [
1834
+ "\n"
1835
+ ],
1836
+ "output_regex": null,
1837
+ "frozen": false,
1838
+ "suite": [
1839
+ "leaderboard",
1840
+ "mmlu"
1841
+ ],
1842
+ "original_num_docs": 100,
1843
+ "effective_num_docs": 100,
1844
+ "trust_dataset": true,
1845
+ "must_remove_duplicate_docs": null
1846
+ },
1847
+ "leaderboard|mmlu:miscellaneous": {
1848
+ "name": "mmlu:miscellaneous",
1849
+ "prompt_function": "mmlu_harness",
1850
+ "hf_repo": "lighteval/mmlu",
1851
+ "hf_subset": "miscellaneous",
1852
+ "metric": [
1853
+ "loglikelihood_acc"
1854
+ ],
1855
+ "hf_avail_splits": [
1856
+ "auxiliary_train",
1857
+ "test",
1858
+ "validation",
1859
+ "dev"
1860
+ ],
1861
+ "evaluation_splits": [
1862
+ "test"
1863
+ ],
1864
+ "few_shots_split": "dev",
1865
+ "few_shots_select": "sequential",
1866
+ "generation_size": 1,
1867
+ "stop_sequence": [
1868
+ "\n"
1869
+ ],
1870
+ "output_regex": null,
1871
+ "frozen": false,
1872
+ "suite": [
1873
+ "leaderboard",
1874
+ "mmlu"
1875
+ ],
1876
+ "original_num_docs": 783,
1877
+ "effective_num_docs": 783,
1878
+ "trust_dataset": true,
1879
+ "must_remove_duplicate_docs": null
1880
+ },
1881
+ "leaderboard|mmlu:moral_disputes": {
1882
+ "name": "mmlu:moral_disputes",
1883
+ "prompt_function": "mmlu_harness",
1884
+ "hf_repo": "lighteval/mmlu",
1885
+ "hf_subset": "moral_disputes",
1886
+ "metric": [
1887
+ "loglikelihood_acc"
1888
+ ],
1889
+ "hf_avail_splits": [
1890
+ "auxiliary_train",
1891
+ "test",
1892
+ "validation",
1893
+ "dev"
1894
+ ],
1895
+ "evaluation_splits": [
1896
+ "test"
1897
+ ],
1898
+ "few_shots_split": "dev",
1899
+ "few_shots_select": "sequential",
1900
+ "generation_size": 1,
1901
+ "stop_sequence": [
1902
+ "\n"
1903
+ ],
1904
+ "output_regex": null,
1905
+ "frozen": false,
1906
+ "suite": [
1907
+ "leaderboard",
1908
+ "mmlu"
1909
+ ],
1910
+ "original_num_docs": 346,
1911
+ "effective_num_docs": 346,
1912
+ "trust_dataset": true,
1913
+ "must_remove_duplicate_docs": null
1914
+ },
1915
+ "leaderboard|mmlu:moral_scenarios": {
1916
+ "name": "mmlu:moral_scenarios",
1917
+ "prompt_function": "mmlu_harness",
1918
+ "hf_repo": "lighteval/mmlu",
1919
+ "hf_subset": "moral_scenarios",
1920
+ "metric": [
1921
+ "loglikelihood_acc"
1922
+ ],
1923
+ "hf_avail_splits": [
1924
+ "auxiliary_train",
1925
+ "test",
1926
+ "validation",
1927
+ "dev"
1928
+ ],
1929
+ "evaluation_splits": [
1930
+ "test"
1931
+ ],
1932
+ "few_shots_split": "dev",
1933
+ "few_shots_select": "sequential",
1934
+ "generation_size": 1,
1935
+ "stop_sequence": [
1936
+ "\n"
1937
+ ],
1938
+ "output_regex": null,
1939
+ "frozen": false,
1940
+ "suite": [
1941
+ "leaderboard",
1942
+ "mmlu"
1943
+ ],
1944
+ "original_num_docs": 895,
1945
+ "effective_num_docs": 895,
1946
+ "trust_dataset": true,
1947
+ "must_remove_duplicate_docs": null
1948
+ },
1949
+ "leaderboard|mmlu:nutrition": {
1950
+ "name": "mmlu:nutrition",
1951
+ "prompt_function": "mmlu_harness",
1952
+ "hf_repo": "lighteval/mmlu",
1953
+ "hf_subset": "nutrition",
1954
+ "metric": [
1955
+ "loglikelihood_acc"
1956
+ ],
1957
+ "hf_avail_splits": [
1958
+ "auxiliary_train",
1959
+ "test",
1960
+ "validation",
1961
+ "dev"
1962
+ ],
1963
+ "evaluation_splits": [
1964
+ "test"
1965
+ ],
1966
+ "few_shots_split": "dev",
1967
+ "few_shots_select": "sequential",
1968
+ "generation_size": 1,
1969
+ "stop_sequence": [
1970
+ "\n"
1971
+ ],
1972
+ "output_regex": null,
1973
+ "frozen": false,
1974
+ "suite": [
1975
+ "leaderboard",
1976
+ "mmlu"
1977
+ ],
1978
+ "original_num_docs": 306,
1979
+ "effective_num_docs": 306,
1980
+ "trust_dataset": true,
1981
+ "must_remove_duplicate_docs": null
1982
+ },
1983
+ "leaderboard|mmlu:philosophy": {
1984
+ "name": "mmlu:philosophy",
1985
+ "prompt_function": "mmlu_harness",
1986
+ "hf_repo": "lighteval/mmlu",
1987
+ "hf_subset": "philosophy",
1988
+ "metric": [
1989
+ "loglikelihood_acc"
1990
+ ],
1991
+ "hf_avail_splits": [
1992
+ "auxiliary_train",
1993
+ "test",
1994
+ "validation",
1995
+ "dev"
1996
+ ],
1997
+ "evaluation_splits": [
1998
+ "test"
1999
+ ],
2000
+ "few_shots_split": "dev",
2001
+ "few_shots_select": "sequential",
2002
+ "generation_size": 1,
2003
+ "stop_sequence": [
2004
+ "\n"
2005
+ ],
2006
+ "output_regex": null,
2007
+ "frozen": false,
2008
+ "suite": [
2009
+ "leaderboard",
2010
+ "mmlu"
2011
+ ],
2012
+ "original_num_docs": 311,
2013
+ "effective_num_docs": 311,
2014
+ "trust_dataset": true,
2015
+ "must_remove_duplicate_docs": null
2016
+ },
2017
+ "leaderboard|mmlu:prehistory": {
2018
+ "name": "mmlu:prehistory",
2019
+ "prompt_function": "mmlu_harness",
2020
+ "hf_repo": "lighteval/mmlu",
2021
+ "hf_subset": "prehistory",
2022
+ "metric": [
2023
+ "loglikelihood_acc"
2024
+ ],
2025
+ "hf_avail_splits": [
2026
+ "auxiliary_train",
2027
+ "test",
2028
+ "validation",
2029
+ "dev"
2030
+ ],
2031
+ "evaluation_splits": [
2032
+ "test"
2033
+ ],
2034
+ "few_shots_split": "dev",
2035
+ "few_shots_select": "sequential",
2036
+ "generation_size": 1,
2037
+ "stop_sequence": [
2038
+ "\n"
2039
+ ],
2040
+ "output_regex": null,
2041
+ "frozen": false,
2042
+ "suite": [
2043
+ "leaderboard",
2044
+ "mmlu"
2045
+ ],
2046
+ "original_num_docs": 324,
2047
+ "effective_num_docs": 324,
2048
+ "trust_dataset": true,
2049
+ "must_remove_duplicate_docs": null
2050
+ },
2051
+ "leaderboard|mmlu:professional_accounting": {
2052
+ "name": "mmlu:professional_accounting",
2053
+ "prompt_function": "mmlu_harness",
2054
+ "hf_repo": "lighteval/mmlu",
2055
+ "hf_subset": "professional_accounting",
2056
+ "metric": [
2057
+ "loglikelihood_acc"
2058
+ ],
2059
+ "hf_avail_splits": [
2060
+ "auxiliary_train",
2061
+ "test",
2062
+ "validation",
2063
+ "dev"
2064
+ ],
2065
+ "evaluation_splits": [
2066
+ "test"
2067
+ ],
2068
+ "few_shots_split": "dev",
2069
+ "few_shots_select": "sequential",
2070
+ "generation_size": 1,
2071
+ "stop_sequence": [
2072
+ "\n"
2073
+ ],
2074
+ "output_regex": null,
2075
+ "frozen": false,
2076
+ "suite": [
2077
+ "leaderboard",
2078
+ "mmlu"
2079
+ ],
2080
+ "original_num_docs": 282,
2081
+ "effective_num_docs": 282,
2082
+ "trust_dataset": true,
2083
+ "must_remove_duplicate_docs": null
2084
+ },
2085
+ "leaderboard|mmlu:professional_law": {
2086
+ "name": "mmlu:professional_law",
2087
+ "prompt_function": "mmlu_harness",
2088
+ "hf_repo": "lighteval/mmlu",
2089
+ "hf_subset": "professional_law",
2090
+ "metric": [
2091
+ "loglikelihood_acc"
2092
+ ],
2093
+ "hf_avail_splits": [
2094
+ "auxiliary_train",
2095
+ "test",
2096
+ "validation",
2097
+ "dev"
2098
+ ],
2099
+ "evaluation_splits": [
2100
+ "test"
2101
+ ],
2102
+ "few_shots_split": "dev",
2103
+ "few_shots_select": "sequential",
2104
+ "generation_size": 1,
2105
+ "stop_sequence": [
2106
+ "\n"
2107
+ ],
2108
+ "output_regex": null,
2109
+ "frozen": false,
2110
+ "suite": [
2111
+ "leaderboard",
2112
+ "mmlu"
2113
+ ],
2114
+ "original_num_docs": 1534,
2115
+ "effective_num_docs": 1534,
2116
+ "trust_dataset": true,
2117
+ "must_remove_duplicate_docs": null
2118
+ },
2119
+ "leaderboard|mmlu:professional_medicine": {
2120
+ "name": "mmlu:professional_medicine",
2121
+ "prompt_function": "mmlu_harness",
2122
+ "hf_repo": "lighteval/mmlu",
2123
+ "hf_subset": "professional_medicine",
2124
+ "metric": [
2125
+ "loglikelihood_acc"
2126
+ ],
2127
+ "hf_avail_splits": [
2128
+ "auxiliary_train",
2129
+ "test",
2130
+ "validation",
2131
+ "dev"
2132
+ ],
2133
+ "evaluation_splits": [
2134
+ "test"
2135
+ ],
2136
+ "few_shots_split": "dev",
2137
+ "few_shots_select": "sequential",
2138
+ "generation_size": 1,
2139
+ "stop_sequence": [
2140
+ "\n"
2141
+ ],
2142
+ "output_regex": null,
2143
+ "frozen": false,
2144
+ "suite": [
2145
+ "leaderboard",
2146
+ "mmlu"
2147
+ ],
2148
+ "original_num_docs": 272,
2149
+ "effective_num_docs": 272,
2150
+ "trust_dataset": true,
2151
+ "must_remove_duplicate_docs": null
2152
+ },
2153
+ "leaderboard|mmlu:professional_psychology": {
2154
+ "name": "mmlu:professional_psychology",
2155
+ "prompt_function": "mmlu_harness",
2156
+ "hf_repo": "lighteval/mmlu",
2157
+ "hf_subset": "professional_psychology",
2158
+ "metric": [
2159
+ "loglikelihood_acc"
2160
+ ],
2161
+ "hf_avail_splits": [
2162
+ "auxiliary_train",
2163
+ "test",
2164
+ "validation",
2165
+ "dev"
2166
+ ],
2167
+ "evaluation_splits": [
2168
+ "test"
2169
+ ],
2170
+ "few_shots_split": "dev",
2171
+ "few_shots_select": "sequential",
2172
+ "generation_size": 1,
2173
+ "stop_sequence": [
2174
+ "\n"
2175
+ ],
2176
+ "output_regex": null,
2177
+ "frozen": false,
2178
+ "suite": [
2179
+ "leaderboard",
2180
+ "mmlu"
2181
+ ],
2182
+ "original_num_docs": 612,
2183
+ "effective_num_docs": 612,
2184
+ "trust_dataset": true,
2185
+ "must_remove_duplicate_docs": null
2186
+ },
2187
+ "leaderboard|mmlu:public_relations": {
2188
+ "name": "mmlu:public_relations",
2189
+ "prompt_function": "mmlu_harness",
2190
+ "hf_repo": "lighteval/mmlu",
2191
+ "hf_subset": "public_relations",
2192
+ "metric": [
2193
+ "loglikelihood_acc"
2194
+ ],
2195
+ "hf_avail_splits": [
2196
+ "auxiliary_train",
2197
+ "test",
2198
+ "validation",
2199
+ "dev"
2200
+ ],
2201
+ "evaluation_splits": [
2202
+ "test"
2203
+ ],
2204
+ "few_shots_split": "dev",
2205
+ "few_shots_select": "sequential",
2206
+ "generation_size": 1,
2207
+ "stop_sequence": [
2208
+ "\n"
2209
+ ],
2210
+ "output_regex": null,
2211
+ "frozen": false,
2212
+ "suite": [
2213
+ "leaderboard",
2214
+ "mmlu"
2215
+ ],
2216
+ "original_num_docs": 110,
2217
+ "effective_num_docs": 110,
2218
+ "trust_dataset": true,
2219
+ "must_remove_duplicate_docs": null
2220
+ },
2221
+ "leaderboard|mmlu:security_studies": {
2222
+ "name": "mmlu:security_studies",
2223
+ "prompt_function": "mmlu_harness",
2224
+ "hf_repo": "lighteval/mmlu",
2225
+ "hf_subset": "security_studies",
2226
+ "metric": [
2227
+ "loglikelihood_acc"
2228
+ ],
2229
+ "hf_avail_splits": [
2230
+ "auxiliary_train",
2231
+ "test",
2232
+ "validation",
2233
+ "dev"
2234
+ ],
2235
+ "evaluation_splits": [
2236
+ "test"
2237
+ ],
2238
+ "few_shots_split": "dev",
2239
+ "few_shots_select": "sequential",
2240
+ "generation_size": 1,
2241
+ "stop_sequence": [
2242
+ "\n"
2243
+ ],
2244
+ "output_regex": null,
2245
+ "frozen": false,
2246
+ "suite": [
2247
+ "leaderboard",
2248
+ "mmlu"
2249
+ ],
2250
+ "original_num_docs": 245,
2251
+ "effective_num_docs": 245,
2252
+ "trust_dataset": true,
2253
+ "must_remove_duplicate_docs": null
2254
+ },
2255
+ "leaderboard|mmlu:sociology": {
2256
+ "name": "mmlu:sociology",
2257
+ "prompt_function": "mmlu_harness",
2258
+ "hf_repo": "lighteval/mmlu",
2259
+ "hf_subset": "sociology",
2260
+ "metric": [
2261
+ "loglikelihood_acc"
2262
+ ],
2263
+ "hf_avail_splits": [
2264
+ "auxiliary_train",
2265
+ "test",
2266
+ "validation",
2267
+ "dev"
2268
+ ],
2269
+ "evaluation_splits": [
2270
+ "test"
2271
+ ],
2272
+ "few_shots_split": "dev",
2273
+ "few_shots_select": "sequential",
2274
+ "generation_size": 1,
2275
+ "stop_sequence": [
2276
+ "\n"
2277
+ ],
2278
+ "output_regex": null,
2279
+ "frozen": false,
2280
+ "suite": [
2281
+ "leaderboard",
2282
+ "mmlu"
2283
+ ],
2284
+ "original_num_docs": 201,
2285
+ "effective_num_docs": 201,
2286
+ "trust_dataset": true,
2287
+ "must_remove_duplicate_docs": null
2288
+ },
2289
+ "leaderboard|mmlu:us_foreign_policy": {
2290
+ "name": "mmlu:us_foreign_policy",
2291
+ "prompt_function": "mmlu_harness",
2292
+ "hf_repo": "lighteval/mmlu",
2293
+ "hf_subset": "us_foreign_policy",
2294
+ "metric": [
2295
+ "loglikelihood_acc"
2296
+ ],
2297
+ "hf_avail_splits": [
2298
+ "auxiliary_train",
2299
+ "test",
2300
+ "validation",
2301
+ "dev"
2302
+ ],
2303
+ "evaluation_splits": [
2304
+ "test"
2305
+ ],
2306
+ "few_shots_split": "dev",
2307
+ "few_shots_select": "sequential",
2308
+ "generation_size": 1,
2309
+ "stop_sequence": [
2310
+ "\n"
2311
+ ],
2312
+ "output_regex": null,
2313
+ "frozen": false,
2314
+ "suite": [
2315
+ "leaderboard",
2316
+ "mmlu"
2317
+ ],
2318
+ "original_num_docs": 100,
2319
+ "effective_num_docs": 100,
2320
+ "trust_dataset": true,
2321
+ "must_remove_duplicate_docs": null
2322
+ },
2323
+ "leaderboard|mmlu:virology": {
2324
+ "name": "mmlu:virology",
2325
+ "prompt_function": "mmlu_harness",
2326
+ "hf_repo": "lighteval/mmlu",
2327
+ "hf_subset": "virology",
2328
+ "metric": [
2329
+ "loglikelihood_acc"
2330
+ ],
2331
+ "hf_avail_splits": [
2332
+ "auxiliary_train",
2333
+ "test",
2334
+ "validation",
2335
+ "dev"
2336
+ ],
2337
+ "evaluation_splits": [
2338
+ "test"
2339
+ ],
2340
+ "few_shots_split": "dev",
2341
+ "few_shots_select": "sequential",
2342
+ "generation_size": 1,
2343
+ "stop_sequence": [
2344
+ "\n"
2345
+ ],
2346
+ "output_regex": null,
2347
+ "frozen": false,
2348
+ "suite": [
2349
+ "leaderboard",
2350
+ "mmlu"
2351
+ ],
2352
+ "original_num_docs": 166,
2353
+ "effective_num_docs": 166,
2354
+ "trust_dataset": true,
2355
+ "must_remove_duplicate_docs": null
2356
+ },
2357
+ "leaderboard|mmlu:world_religions": {
2358
+ "name": "mmlu:world_religions",
2359
+ "prompt_function": "mmlu_harness",
2360
+ "hf_repo": "lighteval/mmlu",
2361
+ "hf_subset": "world_religions",
2362
+ "metric": [
2363
+ "loglikelihood_acc"
2364
+ ],
2365
+ "hf_avail_splits": [
2366
+ "auxiliary_train",
2367
+ "test",
2368
+ "validation",
2369
+ "dev"
2370
+ ],
2371
+ "evaluation_splits": [
2372
+ "test"
2373
+ ],
2374
+ "few_shots_split": "dev",
2375
+ "few_shots_select": "sequential",
2376
+ "generation_size": 1,
2377
+ "stop_sequence": [
2378
+ "\n"
2379
+ ],
2380
+ "output_regex": null,
2381
+ "frozen": false,
2382
+ "suite": [
2383
+ "leaderboard",
2384
+ "mmlu"
2385
+ ],
2386
+ "original_num_docs": 171,
2387
+ "effective_num_docs": 171,
2388
+ "trust_dataset": true,
2389
+ "must_remove_duplicate_docs": null
2390
+ },
2391
+ "leaderboard|truthfulqa:mc": {
2392
+ "name": "truthfulqa:mc",
2393
+ "prompt_function": "truthful_qa_multiple_choice",
2394
+ "hf_repo": "truthful_qa",
2395
+ "hf_subset": "multiple_choice",
2396
+ "metric": [
2397
+ "truthfulqa_mc_metrics"
2398
+ ],
2399
+ "hf_avail_splits": [
2400
+ "validation"
2401
+ ],
2402
+ "evaluation_splits": [
2403
+ "validation"
2404
+ ],
2405
+ "few_shots_split": null,
2406
+ "few_shots_select": null,
2407
+ "generation_size": -1,
2408
+ "stop_sequence": [
2409
+ "\n"
2410
+ ],
2411
+ "output_regex": null,
2412
+ "frozen": false,
2413
+ "suite": [
2414
+ "leaderboard"
2415
+ ],
2416
+ "original_num_docs": 817,
2417
+ "effective_num_docs": 817,
2418
+ "trust_dataset": true,
2419
+ "must_remove_duplicate_docs": null
2420
+ },
2421
+ "leaderboard|winogrande": {
2422
+ "name": "winogrande",
2423
+ "prompt_function": "winogrande",
2424
+ "hf_repo": "winogrande",
2425
+ "hf_subset": "winogrande_xl",
2426
+ "metric": [
2427
+ "loglikelihood_acc"
2428
+ ],
2429
+ "hf_avail_splits": [
2430
+ "train",
2431
+ "test",
2432
+ "validation"
2433
+ ],
2434
+ "evaluation_splits": [
2435
+ "validation"
2436
+ ],
2437
+ "few_shots_split": null,
2438
+ "few_shots_select": "random_sampling",
2439
+ "generation_size": -1,
2440
+ "stop_sequence": [
2441
+ "\n"
2442
+ ],
2443
+ "output_regex": null,
2444
+ "frozen": false,
2445
+ "suite": [
2446
+ "leaderboard"
2447
+ ],
2448
+ "original_num_docs": 1267,
2449
+ "effective_num_docs": 1267,
2450
+ "trust_dataset": true,
2451
+ "must_remove_duplicate_docs": null
2452
+ }
2453
+ },
2454
+ "summary_tasks": {
2455
+ "leaderboard|arc:challenge|25": {
2456
+ "hashes": {
2457
+ "hash_examples": "17b0cae357c0259e",
2458
+ "hash_full_prompts": "4aeb23a740784b86",
2459
+ "hash_input_tokens": "6327b032f3de83c4",
2460
+ "hash_cont_tokens": "c77636140035b318"
2461
+ },
2462
+ "truncated": 0,
2463
+ "non_truncated": 1172,
2464
+ "padded": 4687,
2465
+ "non_padded": 0,
2466
+ "effective_few_shots": 25.0,
2467
+ "num_truncated_few_shots": 0
2468
+ },
2469
+ "leaderboard|hellaswag|10": {
2470
+ "hashes": {
2471
+ "hash_examples": "31985c805c3a737e",
2472
+ "hash_full_prompts": "3c2d3440e190b07b",
2473
+ "hash_input_tokens": "bb027c2cf1da51d3",
2474
+ "hash_cont_tokens": "2d70b9577ac439d0"
2475
+ },
2476
+ "truncated": 0,
2477
+ "non_truncated": 10042,
2478
+ "padded": 40105,
2479
+ "non_padded": 63,
2480
+ "effective_few_shots": 10.0,
2481
+ "num_truncated_few_shots": 0
2482
+ },
2483
+ "leaderboard|mmlu:abstract_algebra|5": {
2484
+ "hashes": {
2485
+ "hash_examples": "4c76229e00c9c0e9",
2486
+ "hash_full_prompts": "faefa0cccb952fe0",
2487
+ "hash_input_tokens": "c7100cded1fd23c7",
2488
+ "hash_cont_tokens": "a886b3552371a98b"
2489
+ },
2490
+ "truncated": 0,
2491
+ "non_truncated": 100,
2492
+ "padded": 400,
2493
+ "non_padded": 0,
2494
+ "effective_few_shots": 5.0,
2495
+ "num_truncated_few_shots": 0
2496
+ },
2497
+ "leaderboard|mmlu:anatomy|5": {
2498
+ "hashes": {
2499
+ "hash_examples": "6a1f8104dccbd33b",
2500
+ "hash_full_prompts": "eacd03e46972fa59",
2501
+ "hash_input_tokens": "66c3858c5e24e62f",
2502
+ "hash_cont_tokens": "9be31d13c42ead00"
2503
+ },
2504
+ "truncated": 0,
2505
+ "non_truncated": 135,
2506
+ "padded": 540,
2507
+ "non_padded": 0,
2508
+ "effective_few_shots": 5.0,
2509
+ "num_truncated_few_shots": 0
2510
+ },
2511
+ "leaderboard|mmlu:astronomy|5": {
2512
+ "hashes": {
2513
+ "hash_examples": "1302effa3a76ce4c",
2514
+ "hash_full_prompts": "826cacbdf1f6bfd0",
2515
+ "hash_input_tokens": "5c83cc7051903092",
2516
+ "hash_cont_tokens": "5da09bc77752f437"
2517
+ },
2518
+ "truncated": 0,
2519
+ "non_truncated": 152,
2520
+ "padded": 608,
2521
+ "non_padded": 0,
2522
+ "effective_few_shots": 5.0,
2523
+ "num_truncated_few_shots": 0
2524
+ },
2525
+ "leaderboard|mmlu:business_ethics|5": {
2526
+ "hashes": {
2527
+ "hash_examples": "03cb8bce5336419a",
2528
+ "hash_full_prompts": "518511169382ac39",
2529
+ "hash_input_tokens": "7aeea403244c4473",
2530
+ "hash_cont_tokens": "03b2ebbdc5224bb0"
2531
+ },
2532
+ "truncated": 0,
2533
+ "non_truncated": 100,
2534
+ "padded": 400,
2535
+ "non_padded": 0,
2536
+ "effective_few_shots": 5.0,
2537
+ "num_truncated_few_shots": 0
2538
+ },
2539
+ "leaderboard|mmlu:clinical_knowledge|5": {
2540
+ "hashes": {
2541
+ "hash_examples": "ffbb9c7b2be257f9",
2542
+ "hash_full_prompts": "0b07b0bc774fdfd9",
2543
+ "hash_input_tokens": "ec0c6a5f110eb99d",
2544
+ "hash_cont_tokens": "40dd7263ce5af5de"
2545
+ },
2546
+ "truncated": 0,
2547
+ "non_truncated": 265,
2548
+ "padded": 1060,
2549
+ "non_padded": 0,
2550
+ "effective_few_shots": 5.0,
2551
+ "num_truncated_few_shots": 0
2552
+ },
2553
+ "leaderboard|mmlu:college_biology|5": {
2554
+ "hashes": {
2555
+ "hash_examples": "3ee77f176f38eb8e",
2556
+ "hash_full_prompts": "22cbe0e8dabf98b1",
2557
+ "hash_input_tokens": "98495e6d43b43601",
2558
+ "hash_cont_tokens": "78048b26c5552ac3"
2559
+ },
2560
+ "truncated": 0,
2561
+ "non_truncated": 144,
2562
+ "padded": 576,
2563
+ "non_padded": 0,
2564
+ "effective_few_shots": 5.0,
2565
+ "num_truncated_few_shots": 0
2566
+ },
2567
+ "leaderboard|mmlu:college_chemistry|5": {
2568
+ "hashes": {
2569
+ "hash_examples": "ce61a69c46d47aeb",
2570
+ "hash_full_prompts": "9c1288940a4afb59",
2571
+ "hash_input_tokens": "6d15ae51e4fb0734",
2572
+ "hash_cont_tokens": "e27ea803720e4f81"
2573
+ },
2574
+ "truncated": 0,
2575
+ "non_truncated": 100,
2576
+ "padded": 400,
2577
+ "non_padded": 0,
2578
+ "effective_few_shots": 5.0,
2579
+ "num_truncated_few_shots": 0
2580
+ },
2581
+ "leaderboard|mmlu:college_computer_science|5": {
2582
+ "hashes": {
2583
+ "hash_examples": "32805b52d7d5daab",
2584
+ "hash_full_prompts": "9522781d0cdf1a43",
2585
+ "hash_input_tokens": "d067a9964676ea01",
2586
+ "hash_cont_tokens": "00f531b5784e741a"
2587
+ },
2588
+ "truncated": 0,
2589
+ "non_truncated": 100,
2590
+ "padded": 400,
2591
+ "non_padded": 0,
2592
+ "effective_few_shots": 5.0,
2593
+ "num_truncated_few_shots": 0
2594
+ },
2595
+ "leaderboard|mmlu:college_mathematics|5": {
2596
+ "hashes": {
2597
+ "hash_examples": "55da1a0a0bd33722",
2598
+ "hash_full_prompts": "72fe6f46a57e6ca4",
2599
+ "hash_input_tokens": "cd2d6c5695665f54",
2600
+ "hash_cont_tokens": "7a6c30f41cc94aa7"
2601
+ },
2602
+ "truncated": 0,
2603
+ "non_truncated": 100,
2604
+ "padded": 400,
2605
+ "non_padded": 0,
2606
+ "effective_few_shots": 5.0,
2607
+ "num_truncated_few_shots": 0
2608
+ },
2609
+ "leaderboard|mmlu:college_medicine|5": {
2610
+ "hashes": {
2611
+ "hash_examples": "c33e143163049176",
2612
+ "hash_full_prompts": "dee0989b2c8993f4",
2613
+ "hash_input_tokens": "976ce2b55b7907d5",
2614
+ "hash_cont_tokens": "5f84bdb85e243e5d"
2615
+ },
2616
+ "truncated": 0,
2617
+ "non_truncated": 173,
2618
+ "padded": 692,
2619
+ "non_padded": 0,
2620
+ "effective_few_shots": 5.0,
2621
+ "num_truncated_few_shots": 0
2622
+ },
2623
+ "leaderboard|mmlu:college_physics|5": {
2624
+ "hashes": {
2625
+ "hash_examples": "ebdab1cdb7e555df",
2626
+ "hash_full_prompts": "a1be6b64ea1948c3",
2627
+ "hash_input_tokens": "2bf98ac7bc989c60",
2628
+ "hash_cont_tokens": "f32a0cc41acb4bf8"
2629
+ },
2630
+ "truncated": 0,
2631
+ "non_truncated": 102,
2632
+ "padded": 408,
2633
+ "non_padded": 0,
2634
+ "effective_few_shots": 5.0,
2635
+ "num_truncated_few_shots": 0
2636
+ },
2637
+ "leaderboard|mmlu:computer_security|5": {
2638
+ "hashes": {
2639
+ "hash_examples": "a24fd7d08a560921",
2640
+ "hash_full_prompts": "01bc3fdfdefe67a4",
2641
+ "hash_input_tokens": "239fad08f7e25672",
2642
+ "hash_cont_tokens": "a886b3552371a98b"
2643
+ },
2644
+ "truncated": 0,
2645
+ "non_truncated": 100,
2646
+ "padded": 400,
2647
+ "non_padded": 0,
2648
+ "effective_few_shots": 5.0,
2649
+ "num_truncated_few_shots": 0
2650
+ },
2651
+ "leaderboard|mmlu:conceptual_physics|5": {
2652
+ "hashes": {
2653
+ "hash_examples": "8300977a79386993",
2654
+ "hash_full_prompts": "b39315a8ada3ca79",
2655
+ "hash_input_tokens": "8fd1fa091cf77da8",
2656
+ "hash_cont_tokens": "6408f70f3d9ada31"
2657
+ },
2658
+ "truncated": 0,
2659
+ "non_truncated": 235,
2660
+ "padded": 940,
2661
+ "non_padded": 0,
2662
+ "effective_few_shots": 5.0,
2663
+ "num_truncated_few_shots": 0
2664
+ },
2665
+ "leaderboard|mmlu:econometrics|5": {
2666
+ "hashes": {
2667
+ "hash_examples": "ddde36788a04a46f",
2668
+ "hash_full_prompts": "70bab37ca5fcc48f",
2669
+ "hash_input_tokens": "75797ac68b074a88",
2670
+ "hash_cont_tokens": "2fab100ce81d11e3"
2671
+ },
2672
+ "truncated": 0,
2673
+ "non_truncated": 114,
2674
+ "padded": 456,
2675
+ "non_padded": 0,
2676
+ "effective_few_shots": 5.0,
2677
+ "num_truncated_few_shots": 0
2678
+ },
2679
+ "leaderboard|mmlu:electrical_engineering|5": {
2680
+ "hashes": {
2681
+ "hash_examples": "acbc5def98c19b3f",
2682
+ "hash_full_prompts": "86a4747481c11c61",
2683
+ "hash_input_tokens": "d30b3949f1a869bc",
2684
+ "hash_cont_tokens": "e75df8f470aa4973"
2685
+ },
2686
+ "truncated": 0,
2687
+ "non_truncated": 145,
2688
+ "padded": 580,
2689
+ "non_padded": 0,
2690
+ "effective_few_shots": 5.0,
2691
+ "num_truncated_few_shots": 0
2692
+ },
2693
+ "leaderboard|mmlu:elementary_mathematics|5": {
2694
+ "hashes": {
2695
+ "hash_examples": "146e61d07497a9bd",
2696
+ "hash_full_prompts": "1fe56333735325fa",
2697
+ "hash_input_tokens": "b14ababf1fdaf847",
2698
+ "hash_cont_tokens": "4ea4b4978c1fb85a"
2699
+ },
2700
+ "truncated": 0,
2701
+ "non_truncated": 378,
2702
+ "padded": 1512,
2703
+ "non_padded": 0,
2704
+ "effective_few_shots": 5.0,
2705
+ "num_truncated_few_shots": 0
2706
+ },
2707
+ "leaderboard|mmlu:formal_logic|5": {
2708
+ "hashes": {
2709
+ "hash_examples": "8635216e1909a03f",
2710
+ "hash_full_prompts": "cc83c1ede45f974c",
2711
+ "hash_input_tokens": "0dee944c92ba09fd",
2712
+ "hash_cont_tokens": "bd7b90f7fcc6628b"
2713
+ },
2714
+ "truncated": 0,
2715
+ "non_truncated": 126,
2716
+ "padded": 504,
2717
+ "non_padded": 0,
2718
+ "effective_few_shots": 5.0,
2719
+ "num_truncated_few_shots": 0
2720
+ },
2721
+ "leaderboard|mmlu:global_facts|5": {
2722
+ "hashes": {
2723
+ "hash_examples": "30b315aa6353ee47",
2724
+ "hash_full_prompts": "3a2ec1e2785c69a5",
2725
+ "hash_input_tokens": "5ba3e5396bf746e6",
2726
+ "hash_cont_tokens": "a886b3552371a98b"
2727
+ },
2728
+ "truncated": 0,
2729
+ "non_truncated": 100,
2730
+ "padded": 400,
2731
+ "non_padded": 0,
2732
+ "effective_few_shots": 5.0,
2733
+ "num_truncated_few_shots": 0
2734
+ },
2735
+ "leaderboard|mmlu:high_school_biology|5": {
2736
+ "hashes": {
2737
+ "hash_examples": "c9136373af2180de",
2738
+ "hash_full_prompts": "27646a569cf2a6f8",
2739
+ "hash_input_tokens": "4f3e8567ca1086f0",
2740
+ "hash_cont_tokens": "d294ad795a4ba989"
2741
+ },
2742
+ "truncated": 0,
2743
+ "non_truncated": 310,
2744
+ "padded": 1240,
2745
+ "non_padded": 0,
2746
+ "effective_few_shots": 5.0,
2747
+ "num_truncated_few_shots": 0
2748
+ },
2749
+ "leaderboard|mmlu:high_school_chemistry|5": {
2750
+ "hashes": {
2751
+ "hash_examples": "b0661bfa1add6404",
2752
+ "hash_full_prompts": "6905c6ca76f7b2b7",
2753
+ "hash_input_tokens": "d06720f4af19fcde",
2754
+ "hash_cont_tokens": "208aff39cfca671a"
2755
+ },
2756
+ "truncated": 0,
2757
+ "non_truncated": 203,
2758
+ "padded": 812,
2759
+ "non_padded": 0,
2760
+ "effective_few_shots": 5.0,
2761
+ "num_truncated_few_shots": 0
2762
+ },
2763
+ "leaderboard|mmlu:high_school_computer_science|5": {
2764
+ "hashes": {
2765
+ "hash_examples": "80fc1d623a3d665f",
2766
+ "hash_full_prompts": "b80092241e8b6c06",
2767
+ "hash_input_tokens": "4b42a8ce6184222f",
2768
+ "hash_cont_tokens": "3b482b98e18c249b"
2769
+ },
2770
+ "truncated": 0,
2771
+ "non_truncated": 100,
2772
+ "padded": 400,
2773
+ "non_padded": 0,
2774
+ "effective_few_shots": 5.0,
2775
+ "num_truncated_few_shots": 0
2776
+ },
2777
+ "leaderboard|mmlu:high_school_european_history|5": {
2778
+ "hashes": {
2779
+ "hash_examples": "854da6e5af0fe1a1",
2780
+ "hash_full_prompts": "a3bc32a5dc022ce7",
2781
+ "hash_input_tokens": "9829b92f11e38c39",
2782
+ "hash_cont_tokens": "7b6f4c22b304c3cc"
2783
+ },
2784
+ "truncated": 0,
2785
+ "non_truncated": 165,
2786
+ "padded": 656,
2787
+ "non_padded": 4,
2788
+ "effective_few_shots": 5.0,
2789
+ "num_truncated_few_shots": 0
2790
+ },
2791
+ "leaderboard|mmlu:high_school_geography|5": {
2792
+ "hashes": {
2793
+ "hash_examples": "7dc963c7acd19ad8",
2794
+ "hash_full_prompts": "53f91beae305905d",
2795
+ "hash_input_tokens": "a6e83c8e9a37451f",
2796
+ "hash_cont_tokens": "1a85c9e696d91a66"
2797
+ },
2798
+ "truncated": 0,
2799
+ "non_truncated": 198,
2800
+ "padded": 792,
2801
+ "non_padded": 0,
2802
+ "effective_few_shots": 5.0,
2803
+ "num_truncated_few_shots": 0
2804
+ },
2805
+ "leaderboard|mmlu:high_school_government_and_politics|5": {
2806
+ "hashes": {
2807
+ "hash_examples": "1f675dcdebc9758f",
2808
+ "hash_full_prompts": "623fd7e3495f243f",
2809
+ "hash_input_tokens": "70d3312474815a5e",
2810
+ "hash_cont_tokens": "a47a4530b8790081"
2811
+ },
2812
+ "truncated": 0,
2813
+ "non_truncated": 193,
2814
+ "padded": 772,
2815
+ "non_padded": 0,
2816
+ "effective_few_shots": 5.0,
2817
+ "num_truncated_few_shots": 0
2818
+ },
2819
+ "leaderboard|mmlu:high_school_macroeconomics|5": {
2820
+ "hashes": {
2821
+ "hash_examples": "2fb32cf2d80f0b35",
2822
+ "hash_full_prompts": "378ac13c8abb6c5f",
2823
+ "hash_input_tokens": "f580d17a3214af15",
2824
+ "hash_cont_tokens": "e71e7c6acf44c3e5"
2825
+ },
2826
+ "truncated": 0,
2827
+ "non_truncated": 390,
2828
+ "padded": 1560,
2829
+ "non_padded": 0,
2830
+ "effective_few_shots": 5.0,
2831
+ "num_truncated_few_shots": 0
2832
+ },
2833
+ "leaderboard|mmlu:high_school_mathematics|5": {
2834
+ "hashes": {
2835
+ "hash_examples": "fd6646fdb5d58a1f",
2836
+ "hash_full_prompts": "14d34e0b34750627",
2837
+ "hash_input_tokens": "361a779f3e9723b0",
2838
+ "hash_cont_tokens": "0a886cdd21b224a6"
2839
+ },
2840
+ "truncated": 0,
2841
+ "non_truncated": 270,
2842
+ "padded": 1080,
2843
+ "non_padded": 0,
2844
+ "effective_few_shots": 5.0,
2845
+ "num_truncated_few_shots": 0
2846
+ },
2847
+ "leaderboard|mmlu:high_school_microeconomics|5": {
2848
+ "hashes": {
2849
+ "hash_examples": "2118f21f71d87d84",
2850
+ "hash_full_prompts": "9ac09e5d4da991c9",
2851
+ "hash_input_tokens": "b5bcfd3df743cee0",
2852
+ "hash_cont_tokens": "a5f61d5beba13cc2"
2853
+ },
2854
+ "truncated": 0,
2855
+ "non_truncated": 238,
2856
+ "padded": 952,
2857
+ "non_padded": 0,
2858
+ "effective_few_shots": 5.0,
2859
+ "num_truncated_few_shots": 0
2860
+ },
2861
+ "leaderboard|mmlu:high_school_physics|5": {
2862
+ "hashes": {
2863
+ "hash_examples": "dc3ce06378548565",
2864
+ "hash_full_prompts": "b4832a554d47d224",
2865
+ "hash_input_tokens": "4caf36cb75ba8552",
2866
+ "hash_cont_tokens": "c4135c191e57e8e6"
2867
+ },
2868
+ "truncated": 0,
2869
+ "non_truncated": 151,
2870
+ "padded": 604,
2871
+ "non_padded": 0,
2872
+ "effective_few_shots": 5.0,
2873
+ "num_truncated_few_shots": 0
2874
+ },
2875
+ "leaderboard|mmlu:high_school_psychology|5": {
2876
+ "hashes": {
2877
+ "hash_examples": "c8d1d98a40e11f2f",
2878
+ "hash_full_prompts": "1e8cd27064546274",
2879
+ "hash_input_tokens": "9f7a7525450c0b5b",
2880
+ "hash_cont_tokens": "287bec936450f9c6"
2881
+ },
2882
+ "truncated": 0,
2883
+ "non_truncated": 545,
2884
+ "padded": 2180,
2885
+ "non_padded": 0,
2886
+ "effective_few_shots": 5.0,
2887
+ "num_truncated_few_shots": 0
2888
+ },
2889
+ "leaderboard|mmlu:high_school_statistics|5": {
2890
+ "hashes": {
2891
+ "hash_examples": "666c8759b98ee4ff",
2892
+ "hash_full_prompts": "e05ab41077ec0afa",
2893
+ "hash_input_tokens": "dbb29057733d0628",
2894
+ "hash_cont_tokens": "7e446857c7d6d869"
2895
+ },
2896
+ "truncated": 0,
2897
+ "non_truncated": 216,
2898
+ "padded": 864,
2899
+ "non_padded": 0,
2900
+ "effective_few_shots": 5.0,
2901
+ "num_truncated_few_shots": 0
2902
+ },
2903
+ "leaderboard|mmlu:high_school_us_history|5": {
2904
+ "hashes": {
2905
+ "hash_examples": "95fef1c4b7d3f81e",
2906
+ "hash_full_prompts": "a4b275996a416b4a",
2907
+ "hash_input_tokens": "d2c8de257e0f76fa",
2908
+ "hash_cont_tokens": "8b827fc7dfd3c1c5"
2909
+ },
2910
+ "truncated": 0,
2911
+ "non_truncated": 204,
2912
+ "padded": 816,
2913
+ "non_padded": 0,
2914
+ "effective_few_shots": 5.0,
2915
+ "num_truncated_few_shots": 0
2916
+ },
2917
+ "leaderboard|mmlu:high_school_world_history|5": {
2918
+ "hashes": {
2919
+ "hash_examples": "7e5085b6184b0322",
2920
+ "hash_full_prompts": "8adf16361f0f320a",
2921
+ "hash_input_tokens": "c5e010d66997c529",
2922
+ "hash_cont_tokens": "74875ba92d6648af"
2923
+ },
2924
+ "truncated": 0,
2925
+ "non_truncated": 237,
2926
+ "padded": 948,
2927
+ "non_padded": 0,
2928
+ "effective_few_shots": 5.0,
2929
+ "num_truncated_few_shots": 0
2930
+ },
2931
+ "leaderboard|mmlu:human_aging|5": {
2932
+ "hashes": {
2933
+ "hash_examples": "c17333e7c7c10797",
2934
+ "hash_full_prompts": "918d91a3141aac4d",
2935
+ "hash_input_tokens": "05e6f5df9e81a997",
2936
+ "hash_cont_tokens": "ca87074f1dc39668"
2937
+ },
2938
+ "truncated": 0,
2939
+ "non_truncated": 223,
2940
+ "padded": 892,
2941
+ "non_padded": 0,
2942
+ "effective_few_shots": 5.0,
2943
+ "num_truncated_few_shots": 0
2944
+ },
2945
+ "leaderboard|mmlu:human_sexuality|5": {
2946
+ "hashes": {
2947
+ "hash_examples": "4edd1e9045df5e3d",
2948
+ "hash_full_prompts": "bcee39ecea32fcc8",
2949
+ "hash_input_tokens": "9604ec0f5616cd26",
2950
+ "hash_cont_tokens": "491a0ab53f54aeb9"
2951
+ },
2952
+ "truncated": 0,
2953
+ "non_truncated": 131,
2954
+ "padded": 524,
2955
+ "non_padded": 0,
2956
+ "effective_few_shots": 5.0,
2957
+ "num_truncated_few_shots": 0
2958
+ },
2959
+ "leaderboard|mmlu:international_law|5": {
2960
+ "hashes": {
2961
+ "hash_examples": "db2fa00d771a062a",
2962
+ "hash_full_prompts": "ffe12a3b5bf350c2",
2963
+ "hash_input_tokens": "727bb86160a250d9",
2964
+ "hash_cont_tokens": "8c75cab59d57904d"
2965
+ },
2966
+ "truncated": 0,
2967
+ "non_truncated": 121,
2968
+ "padded": 484,
2969
+ "non_padded": 0,
2970
+ "effective_few_shots": 5.0,
2971
+ "num_truncated_few_shots": 0
2972
+ },
2973
+ "leaderboard|mmlu:jurisprudence|5": {
2974
+ "hashes": {
2975
+ "hash_examples": "e956f86b124076fe",
2976
+ "hash_full_prompts": "b4293c3c08bebaf7",
2977
+ "hash_input_tokens": "013c7941768fda49",
2978
+ "hash_cont_tokens": "4c69d7671fa1ab1c"
2979
+ },
2980
+ "truncated": 0,
2981
+ "non_truncated": 108,
2982
+ "padded": 432,
2983
+ "non_padded": 0,
2984
+ "effective_few_shots": 5.0,
2985
+ "num_truncated_few_shots": 0
2986
+ },
2987
+ "leaderboard|mmlu:logical_fallacies|5": {
2988
+ "hashes": {
2989
+ "hash_examples": "956e0e6365ab79f1",
2990
+ "hash_full_prompts": "8c1b7733e98cbe81",
2991
+ "hash_input_tokens": "8e4f39d6d98efdc5",
2992
+ "hash_cont_tokens": "57e78d3d09b7db81"
2993
+ },
2994
+ "truncated": 0,
2995
+ "non_truncated": 163,
2996
+ "padded": 652,
2997
+ "non_padded": 0,
2998
+ "effective_few_shots": 5.0,
2999
+ "num_truncated_few_shots": 0
3000
+ },
3001
+ "leaderboard|mmlu:machine_learning|5": {
3002
+ "hashes": {
3003
+ "hash_examples": "397997cc6f4d581e",
3004
+ "hash_full_prompts": "24a206a1c639ab8d",
3005
+ "hash_input_tokens": "202eb581c240b8f3",
3006
+ "hash_cont_tokens": "8669a529b8d281b3"
3007
+ },
3008
+ "truncated": 0,
3009
+ "non_truncated": 112,
3010
+ "padded": 448,
3011
+ "non_padded": 0,
3012
+ "effective_few_shots": 5.0,
3013
+ "num_truncated_few_shots": 0
3014
+ },
3015
+ "leaderboard|mmlu:management|5": {
3016
+ "hashes": {
3017
+ "hash_examples": "2bcbe6f6ca63d740",
3018
+ "hash_full_prompts": "77e1c79d988beecc",
3019
+ "hash_input_tokens": "5349fe24ec6c3315",
3020
+ "hash_cont_tokens": "79499fecb18f1cb1"
3021
+ },
3022
+ "truncated": 0,
3023
+ "non_truncated": 103,
3024
+ "padded": 412,
3025
+ "non_padded": 0,
3026
+ "effective_few_shots": 5.0,
3027
+ "num_truncated_few_shots": 0
3028
+ },
3029
+ "leaderboard|mmlu:marketing|5": {
3030
+ "hashes": {
3031
+ "hash_examples": "8ddb20d964a1b065",
3032
+ "hash_full_prompts": "83cec2fa6b681d9d",
3033
+ "hash_input_tokens": "2d35adb4e63840cc",
3034
+ "hash_cont_tokens": "c5e9cd86b1a58fac"
3035
+ },
3036
+ "truncated": 0,
3037
+ "non_truncated": 234,
3038
+ "padded": 936,
3039
+ "non_padded": 0,
3040
+ "effective_few_shots": 5.0,
3041
+ "num_truncated_few_shots": 0
3042
+ },
3043
+ "leaderboard|mmlu:medical_genetics|5": {
3044
+ "hashes": {
3045
+ "hash_examples": "182a71f4763d2cea",
3046
+ "hash_full_prompts": "195eb7ff99749730",
3047
+ "hash_input_tokens": "012f4687f48a688b",
3048
+ "hash_cont_tokens": "a886b3552371a98b"
3049
+ },
3050
+ "truncated": 0,
3051
+ "non_truncated": 100,
3052
+ "padded": 400,
3053
+ "non_padded": 0,
3054
+ "effective_few_shots": 5.0,
3055
+ "num_truncated_few_shots": 0
3056
+ },
3057
+ "leaderboard|mmlu:miscellaneous|5": {
3058
+ "hashes": {
3059
+ "hash_examples": "4c404fdbb4ca57fc",
3060
+ "hash_full_prompts": "33539955c9a96851",
3061
+ "hash_input_tokens": "4089d35aa35d7c39",
3062
+ "hash_cont_tokens": "8578b82c42cc7026"
3063
+ },
3064
+ "truncated": 0,
3065
+ "non_truncated": 783,
3066
+ "padded": 3132,
3067
+ "non_padded": 0,
3068
+ "effective_few_shots": 5.0,
3069
+ "num_truncated_few_shots": 0
3070
+ },
3071
+ "leaderboard|mmlu:moral_disputes|5": {
3072
+ "hashes": {
3073
+ "hash_examples": "60cbd2baa3fea5c9",
3074
+ "hash_full_prompts": "009b7d0e7f819eff",
3075
+ "hash_input_tokens": "92852a9aaaa68ac1",
3076
+ "hash_cont_tokens": "26b0f808ec46464d"
3077
+ },
3078
+ "truncated": 0,
3079
+ "non_truncated": 346,
3080
+ "padded": 1384,
3081
+ "non_padded": 0,
3082
+ "effective_few_shots": 5.0,
3083
+ "num_truncated_few_shots": 0
3084
+ },
3085
+ "leaderboard|mmlu:moral_scenarios|5": {
3086
+ "hashes": {
3087
+ "hash_examples": "fd8b0431fbdd75ef",
3088
+ "hash_full_prompts": "f6e63c9fb9d3bff0",
3089
+ "hash_input_tokens": "05add168b9a55fbc",
3090
+ "hash_cont_tokens": "24ce197370bb5b07"
3091
+ },
3092
+ "truncated": 0,
3093
+ "non_truncated": 895,
3094
+ "padded": 3580,
3095
+ "non_padded": 0,
3096
+ "effective_few_shots": 5.0,
3097
+ "num_truncated_few_shots": 0
3098
+ },
3099
+ "leaderboard|mmlu:nutrition|5": {
3100
+ "hashes": {
3101
+ "hash_examples": "71e55e2b829b6528",
3102
+ "hash_full_prompts": "8294d5e3ad435377",
3103
+ "hash_input_tokens": "742231f73012b1e2",
3104
+ "hash_cont_tokens": "4745352f3c85c108"
3105
+ },
3106
+ "truncated": 0,
3107
+ "non_truncated": 306,
3108
+ "padded": 1224,
3109
+ "non_padded": 0,
3110
+ "effective_few_shots": 5.0,
3111
+ "num_truncated_few_shots": 0
3112
+ },
3113
+ "leaderboard|mmlu:philosophy|5": {
3114
+ "hashes": {
3115
+ "hash_examples": "a6d489a8d208fa4b",
3116
+ "hash_full_prompts": "db68c0f4503e4793",
3117
+ "hash_input_tokens": "cad5ce61a647bc46",
3118
+ "hash_cont_tokens": "8c34ab2fa65c3b6e"
3119
+ },
3120
+ "truncated": 0,
3121
+ "non_truncated": 311,
3122
+ "padded": 1244,
3123
+ "non_padded": 0,
3124
+ "effective_few_shots": 5.0,
3125
+ "num_truncated_few_shots": 0
3126
+ },
3127
+ "leaderboard|mmlu:prehistory|5": {
3128
+ "hashes": {
3129
+ "hash_examples": "6cc50f032a19acaa",
3130
+ "hash_full_prompts": "3972bcfa8c80e964",
3131
+ "hash_input_tokens": "32a29cc657790558",
3132
+ "hash_cont_tokens": "ab44396c679556f3"
3133
+ },
3134
+ "truncated": 0,
3135
+ "non_truncated": 324,
3136
+ "padded": 1296,
3137
+ "non_padded": 0,
3138
+ "effective_few_shots": 5.0,
3139
+ "num_truncated_few_shots": 0
3140
+ },
3141
+ "leaderboard|mmlu:professional_accounting|5": {
3142
+ "hashes": {
3143
+ "hash_examples": "50f57ab32f5f6cea",
3144
+ "hash_full_prompts": "25f0becc2483bd32",
3145
+ "hash_input_tokens": "cacacb04b2a59c5a",
3146
+ "hash_cont_tokens": "e3eb8866fd5dce77"
3147
+ },
3148
+ "truncated": 0,
3149
+ "non_truncated": 282,
3150
+ "padded": 1120,
3151
+ "non_padded": 8,
3152
+ "effective_few_shots": 5.0,
3153
+ "num_truncated_few_shots": 0
3154
+ },
3155
+ "leaderboard|mmlu:professional_law|5": {
3156
+ "hashes": {
3157
+ "hash_examples": "a8fdc85c64f4b215",
3158
+ "hash_full_prompts": "7a6f6c5706f00c7d",
3159
+ "hash_input_tokens": "4b463ba71a1b650f",
3160
+ "hash_cont_tokens": "2ae4ea5b043b942a"
3161
+ },
3162
+ "truncated": 0,
3163
+ "non_truncated": 1534,
3164
+ "padded": 6136,
3165
+ "non_padded": 0,
3166
+ "effective_few_shots": 5.0,
3167
+ "num_truncated_few_shots": 0
3168
+ },
3169
+ "leaderboard|mmlu:professional_medicine|5": {
3170
+ "hashes": {
3171
+ "hash_examples": "c373a28a3050a73a",
3172
+ "hash_full_prompts": "a74b6ac7c5c545d2",
3173
+ "hash_input_tokens": "b2744b569a6a32fc",
3174
+ "hash_cont_tokens": "fc82ad9eca8a7b98"
3175
+ },
3176
+ "truncated": 0,
3177
+ "non_truncated": 272,
3178
+ "padded": 1088,
3179
+ "non_padded": 0,
3180
+ "effective_few_shots": 5.0,
3181
+ "num_truncated_few_shots": 0
3182
+ },
3183
+ "leaderboard|mmlu:professional_psychology|5": {
3184
+ "hashes": {
3185
+ "hash_examples": "bf5254fe818356af",
3186
+ "hash_full_prompts": "c53fa139ec25f502",
3187
+ "hash_input_tokens": "3775c049ee940ea3",
3188
+ "hash_cont_tokens": "0cc4c9bd9df094ef"
3189
+ },
3190
+ "truncated": 0,
3191
+ "non_truncated": 612,
3192
+ "padded": 2448,
3193
+ "non_padded": 0,
3194
+ "effective_few_shots": 5.0,
3195
+ "num_truncated_few_shots": 0
3196
+ },
3197
+ "leaderboard|mmlu:public_relations|5": {
3198
+ "hashes": {
3199
+ "hash_examples": "b66d52e28e7d14e0",
3200
+ "hash_full_prompts": "55b5eff05aa6bf13",
3201
+ "hash_input_tokens": "be078a9672a35a48",
3202
+ "hash_cont_tokens": "680235f5ede0b353"
3203
+ },
3204
+ "truncated": 0,
3205
+ "non_truncated": 110,
3206
+ "padded": 440,
3207
+ "non_padded": 0,
3208
+ "effective_few_shots": 5.0,
3209
+ "num_truncated_few_shots": 0
3210
+ },
3211
+ "leaderboard|mmlu:security_studies|5": {
3212
+ "hashes": {
3213
+ "hash_examples": "514c14feaf000ad9",
3214
+ "hash_full_prompts": "6690ecdc054f7b0c",
3215
+ "hash_input_tokens": "3022dd1ffded02a9",
3216
+ "hash_cont_tokens": "2119792a6103cc24"
3217
+ },
3218
+ "truncated": 0,
3219
+ "non_truncated": 245,
3220
+ "padded": 980,
3221
+ "non_padded": 0,
3222
+ "effective_few_shots": 5.0,
3223
+ "num_truncated_few_shots": 0
3224
+ },
3225
+ "leaderboard|mmlu:sociology|5": {
3226
+ "hashes": {
3227
+ "hash_examples": "f6c9bc9d18c80870",
3228
+ "hash_full_prompts": "945fbdd091c72d64",
3229
+ "hash_input_tokens": "4762d7cdcc303fe1",
3230
+ "hash_cont_tokens": "2178ff937c0c1a29"
3231
+ },
3232
+ "truncated": 0,
3233
+ "non_truncated": 201,
3234
+ "padded": 804,
3235
+ "non_padded": 0,
3236
+ "effective_few_shots": 5.0,
3237
+ "num_truncated_few_shots": 0
3238
+ },
3239
+ "leaderboard|mmlu:us_foreign_policy|5": {
3240
+ "hashes": {
3241
+ "hash_examples": "ed7b78629db6678f",
3242
+ "hash_full_prompts": "ebba6ea6eca4ae53",
3243
+ "hash_input_tokens": "880355a94d9fe5b1",
3244
+ "hash_cont_tokens": "a886b3552371a98b"
3245
+ },
3246
+ "truncated": 0,
3247
+ "non_truncated": 100,
3248
+ "padded": 392,
3249
+ "non_padded": 8,
3250
+ "effective_few_shots": 5.0,
3251
+ "num_truncated_few_shots": 0
3252
+ },
3253
+ "leaderboard|mmlu:virology|5": {
3254
+ "hashes": {
3255
+ "hash_examples": "bc52ffdc3f9b994a",
3256
+ "hash_full_prompts": "a2ee4984d6877fe3",
3257
+ "hash_input_tokens": "65c8ea545351aa14",
3258
+ "hash_cont_tokens": "ec5c187546c7c842"
3259
+ },
3260
+ "truncated": 0,
3261
+ "non_truncated": 166,
3262
+ "padded": 660,
3263
+ "non_padded": 4,
3264
+ "effective_few_shots": 5.0,
3265
+ "num_truncated_few_shots": 0
3266
+ },
3267
+ "leaderboard|mmlu:world_religions|5": {
3268
+ "hashes": {
3269
+ "hash_examples": "ecdb4a4f94f62930",
3270
+ "hash_full_prompts": "a89c8dddd1d8ced0",
3271
+ "hash_input_tokens": "0d36fd4bf3b571e1",
3272
+ "hash_cont_tokens": "65bc44ac97c3227a"
3273
+ },
3274
+ "truncated": 0,
3275
+ "non_truncated": 171,
3276
+ "padded": 684,
3277
+ "non_padded": 0,
3278
+ "effective_few_shots": 5.0,
3279
+ "num_truncated_few_shots": 0
3280
+ },
3281
+ "leaderboard|truthfulqa:mc|0": {
3282
+ "hashes": {
3283
+ "hash_examples": "36a6d90e75d92d4a",
3284
+ "hash_full_prompts": "8d9ca0a8bd458a1c",
3285
+ "hash_input_tokens": "89f619d8a8d594e0",
3286
+ "hash_cont_tokens": "8eaf3b80e9854172"
3287
+ },
3288
+ "truncated": 0,
3289
+ "non_truncated": 817,
3290
+ "padded": 9996,
3291
+ "non_padded": 0,
3292
+ "effective_few_shots": 0.0,
3293
+ "num_truncated_few_shots": 0
3294
+ },
3295
+ "leaderboard|winogrande|5": {
3296
+ "hashes": {
3297
+ "hash_examples": "087d5d1a1afd4c7b",
3298
+ "hash_full_prompts": "35da55e47222e0e1",
3299
+ "hash_input_tokens": "25973bc571721c55",
3300
+ "hash_cont_tokens": "39be0da00f68561c"
3301
+ },
3302
+ "truncated": 0,
3303
+ "non_truncated": 1267,
3304
+ "padded": 2534,
3305
+ "non_padded": 0,
3306
+ "effective_few_shots": 5.0,
3307
+ "num_truncated_few_shots": 0
3308
+ },
3309
+ "leaderboard|gsm8k|5": {
3310
+ "hashes": {
3311
+ "hash_examples": "0ed016e24e7512fd",
3312
+ "hash_full_prompts": "f7ab209f6467841e",
3313
+ "hash_input_tokens": "650eb62258948f16",
3314
+ "hash_cont_tokens": "bd3608724a4cf68d"
3315
+ },
3316
+ "truncated": 1319,
3317
+ "non_truncated": 0,
3318
+ "padded": 487,
3319
+ "non_padded": 832,
3320
+ "effective_few_shots": 5.0,
3321
+ "num_truncated_few_shots": 0
3322
+ }
3323
+ },
3324
+ "summary_general": {
3325
+ "hashes": {
3326
+ "hash_examples": "670666fa3a90ce5d",
3327
+ "hash_full_prompts": "56c005e427046302",
3328
+ "hash_input_tokens": "3d48c4bd6b9d4a57",
3329
+ "hash_cont_tokens": "9c01009736bb767d"
3330
+ },
3331
+ "truncated": 1319,
3332
+ "non_truncated": 27340,
3333
+ "padded": 113953,
3334
+ "non_padded": 919,
3335
+ "num_truncated_few_shots": 0
3336
+ }
3337
+ }