Spaces:
Running
Running
steampunque
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,54 @@ pinned: false
|
|
8 |
license: apache-2.0
|
9 |
short_description: llm benchmarks
|
10 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
TEST | MODELA | MODELB
|
12 |
-----------|-----------|--------
|
13 |
-
WINOGRANDE | 1.0 |
|
14 |
|
|
|
8 |
license: apache-2.0
|
9 |
short_description: llm benchmarks
|
10 |
---
|
11 |
+
TESTS:
|
12 |
+
KNOWLEDGE:
|
13 |
+
TQA - Truthful QA
|
14 |
+
JEOPARDY - 100 Question JEOPARDY quiz
|
15 |
+
LANGUAGE:
|
16 |
+
LAMBADA - Language Modeling Broadened to Account for Discourse Aspects
|
17 |
+
UNDERSTANDING:
|
18 |
+
WG - Winogrande
|
19 |
+
BOOLQ - Boolean questions
|
20 |
+
STORYCLOZE - Story questions
|
21 |
+
OBQA - Open Book Question / Answer
|
22 |
+
SIQA - Social IQ
|
23 |
+
RACE - Reading comprehension dataset from examinations
|
24 |
+
MMLU - massive multitask language understanding
|
25 |
+
MEDQA - medical QA
|
26 |
+
REASONING
|
27 |
+
CSQA - Common Sense Question Answer
|
28 |
+
COPA - Choice of Plausible Alternatives
|
29 |
+
HELLASWAG - Hella Situations with Adversarial Generations
|
30 |
+
PIQA - Physical Interaction: Question Answering
|
31 |
+
ARC - A12 Reasoning Challenge
|
32 |
+
AGIEVAL - AGIEval logiqa, lsat, sat
|
33 |
+
AGIEVALC - Gaokao SAT, logiqa, jec (Chinese)
|
34 |
+
MUSR - Multimodal Semantic Reasoning
|
35 |
+
COT:
|
36 |
+
GSM8K - Grade School Math CoT
|
37 |
+
BBH - Beyond the Imitation Game Bench Hard CoT
|
38 |
+
MMLUPRO - massive multitask language understanding pro CoT
|
39 |
+
AGIEVAL - satmath, aquarat
|
40 |
+
AGIEVALC - mathcloze, mathqa (Chinese)
|
41 |
+
MUSR - Multimodal Semantic Reasoning
|
42 |
+
APPLE - 100 custom Apple Questions
|
43 |
+
CODE:
|
44 |
+
HUMANEVAL - Python
|
45 |
+
HUMANEVALP - Python, extended test
|
46 |
+
HUMANEVALX - Python, Java, Javascript, C++
|
47 |
+
MBPP - Python
|
48 |
+
MBPPP - Python, extendend test
|
49 |
+
CRUXEVAL - Python
|
50 |
+
USE {TEST}FIM FOR FIM TEST, i.e. HUMANEVAL->HUMANEVALFIM
|
51 |
+
|
52 |
+
METHODOLOGY: All CoT tests are zero shot.
|
53 |
+
All MC tests do two queries, 1 with answers in test order and 2nd with answers circularly shifted 1.
|
54 |
+
To score a correct answer in MC both queries must answer correctly.
|
55 |
+
Winogrande using logprob completion (evaluates the probability of a common completion for the two possible cases).
|
56 |
+
Tests are run using a modified llama.cpp server (supporting logprob completion mode) and/or textsynth server where noted.
|
57 |
+
|
58 |
TEST | MODELA | MODELB
|
59 |
-----------|-----------|--------
|
60 |
+
WINOGRANDE | 1.0 | 0.5
|
61 |
|