steampunque commited on
Commit
cccf104
·
verified ·
1 Parent(s): eaf030f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -1
README.md CHANGED
@@ -8,7 +8,54 @@ pinned: false
8
  license: apache-2.0
9
  short_description: llm benchmarks
10
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  TEST | MODELA | MODELB
12
  -----------|-----------|--------
13
- WINOGRANDE | 1.0 | 2.0
14
 
 
8
  license: apache-2.0
9
  short_description: llm benchmarks
10
  ---
11
+ TESTS:
12
+ KNOWLEDGE:
13
+ TQA - Truthful QA
14
+ JEOPARDY - 100 Question JEOPARDY quiz
15
+ LANGUAGE:
16
+ LAMBADA - Language Modeling Broadened to Account for Discourse Aspects
17
+ UNDERSTANDING:
18
+ WG - Winogrande
19
+ BOOLQ - Boolean questions
20
+ STORYCLOZE - Story questions
21
+ OBQA - Open Book Question / Answer
22
+ SIQA - Social IQ
23
+ RACE - Reading comprehension dataset from examinations
24
+ MMLU - massive multitask language understanding
25
+ MEDQA - medical QA
26
+ REASONING
27
+ CSQA - Common Sense Question Answer
28
+ COPA - Choice of Plausible Alternatives
29
+ HELLASWAG - Hella Situations with Adversarial Generations
30
+ PIQA - Physical Interaction: Question Answering
31
+ ARC - A12 Reasoning Challenge
32
+ AGIEVAL - AGIEval logiqa, lsat, sat
33
+ AGIEVALC - Gaokao SAT, logiqa, jec (Chinese)
34
+ MUSR - Multimodal Semantic Reasoning
35
+ COT:
36
+ GSM8K - Grade School Math CoT
37
+ BBH - Beyond the Imitation Game Bench Hard CoT
38
+ MMLUPRO - massive multitask language understanding pro CoT
39
+ AGIEVAL - satmath, aquarat
40
+ AGIEVALC - mathcloze, mathqa (Chinese)
41
+ MUSR - Multimodal Semantic Reasoning
42
+ APPLE - 100 custom Apple Questions
43
+ CODE:
44
+ HUMANEVAL - Python
45
+ HUMANEVALP - Python, extended test
46
+ HUMANEVALX - Python, Java, Javascript, C++
47
+ MBPP - Python
48
+ MBPPP - Python, extendend test
49
+ CRUXEVAL - Python
50
+ USE {TEST}FIM FOR FIM TEST, i.e. HUMANEVAL->HUMANEVALFIM
51
+
52
+ METHODOLOGY: All CoT tests are zero shot.
53
+ All MC tests do two queries, 1 with answers in test order and 2nd with answers circularly shifted 1.
54
+ To score a correct answer in MC both queries must answer correctly.
55
+ Winogrande using logprob completion (evaluates the probability of a common completion for the two possible cases).
56
+ Tests are run using a modified llama.cpp server (supporting logprob completion mode) and/or textsynth server where noted.
57
+
58
  TEST | MODELA | MODELB
59
  -----------|-----------|--------
60
+ WINOGRANDE | 1.0 | 0.5
61