Update README.md
Browse files
README.md
CHANGED
@@ -37,6 +37,64 @@ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-le
|
|
37 |
|MuSR (0-shot) |11.15|
|
38 |
|MMLU-PRO (5-shot) |30.34|
|
39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
# merge tree
|
42 |
|
|
|
37 |
|MuSR (0-shot) |11.15|
|
38 |
|MMLU-PRO (5-shot) |30.34|
|
39 |
|
40 |
+
| Model |AGIEval|TruthfulQA|Bigbench|
|
41 |
+
|--------------------------------------------------------------------------------|------:|---------:|-------:|
|
42 |
+
|[llama3.1-8b-spaetzle-v90](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90)| 42.05| 57.2| 44.75|
|
43 |
+
|
44 |
+
### AGIEval
|
45 |
+
| Task |Version| Metric |Value| |Stderr|
|
46 |
+
|------------------------------|------:|--------|----:|---|-----:|
|
47 |
+
|agieval_aqua_rat | 0|acc |24.02|± | 2.69|
|
48 |
+
| | |acc_norm|23.62|± | 2.67|
|
49 |
+
|agieval_logiqa_en | 0|acc |40.09|± | 1.92|
|
50 |
+
| | |acc_norm|39.78|± | 1.92|
|
51 |
+
|agieval_lsat_ar | 0|acc |22.17|± | 2.75|
|
52 |
+
| | |acc_norm|21.74|± | 2.73|
|
53 |
+
|agieval_lsat_lr | 0|acc |50.39|± | 2.22|
|
54 |
+
| | |acc_norm|45.29|± | 2.21|
|
55 |
+
|agieval_lsat_rc | 0|acc |64.31|± | 2.93|
|
56 |
+
| | |acc_norm|58.36|± | 3.01|
|
57 |
+
|agieval_sat_en | 0|acc |81.07|± | 2.74|
|
58 |
+
| | |acc_norm|73.79|± | 3.07|
|
59 |
+
|agieval_sat_en_without_passage| 0|acc |45.15|± | 3.48|
|
60 |
+
| | |acc_norm|38.83|± | 3.40|
|
61 |
+
|agieval_sat_math | 0|acc |40.91|± | 3.32|
|
62 |
+
| | |acc_norm|35.00|± | 3.22|
|
63 |
+
|
64 |
+
Average: 42.05%
|
65 |
+
|
66 |
+
### TruthfulQA
|
67 |
+
| Task |Version|Metric|Value| |Stderr|
|
68 |
+
|-------------|------:|------|----:|---|-----:|
|
69 |
+
|truthfulqa_mc| 1|mc1 |39.66|± | 1.71|
|
70 |
+
| | |mc2 |57.20|± | 1.51|
|
71 |
+
|
72 |
+
Average: 57.2%
|
73 |
+
|
74 |
+
### Bigbench
|
75 |
+
| Task |Version| Metric |Value| |Stderr|
|
76 |
+
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|
77 |
+
|bigbench_causal_judgement | 0|multiple_choice_grade|58.42|± | 3.59|
|
78 |
+
|bigbench_date_understanding | 0|multiple_choice_grade|70.46|± | 2.38|
|
79 |
+
|bigbench_disambiguation_qa | 0|multiple_choice_grade|31.40|± | 2.89|
|
80 |
+
|bigbench_geometric_shapes | 0|multiple_choice_grade|33.43|± | 2.49|
|
81 |
+
| | |exact_str_match | 0.00|± | 0.00|
|
82 |
+
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|30.00|± | 2.05|
|
83 |
+
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|24.29|± | 1.62|
|
84 |
+
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|56.00|± | 2.87|
|
85 |
+
|bigbench_movie_recommendation | 0|multiple_choice_grade|38.20|± | 2.18|
|
86 |
+
|bigbench_navigate | 0|multiple_choice_grade|50.20|± | 1.58|
|
87 |
+
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|69.50|± | 1.03|
|
88 |
+
|bigbench_ruin_names | 0|multiple_choice_grade|54.46|± | 2.36|
|
89 |
+
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|32.77|± | 1.49|
|
90 |
+
|bigbench_snarks | 0|multiple_choice_grade|65.19|± | 3.55|
|
91 |
+
|bigbench_sports_understanding | 0|multiple_choice_grade|50.30|± | 1.59|
|
92 |
+
|bigbench_temporal_sequences | 0|multiple_choice_grade|45.70|± | 1.58|
|
93 |
+
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.08|± | 1.17|
|
94 |
+
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.03|± | 0.90|
|
95 |
+
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|56.00|± | 2.87|
|
96 |
+
|
97 |
+
Average: 44.75%
|
98 |
|
99 |
# merge tree
|
100 |
|