stellaathena commited on
Commit
a7c07bf
1 Parent(s): a6259a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -11
README.md CHANGED
@@ -48,25 +48,31 @@ GPT-Neo was trained as an autoregressive language model. This means that its cor
48
  GPT-Neo was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending on your usecase GPT-Neo may produce socially unacceptable text. See Sections 5 and 6 of the Pile paper for a more detailed analysis of the biases in the Pile.
49
 
50
  As with all language models, it is hard to predict in advance how GPT-Neo will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 
51
  ## Eval results
52
 
53
- ### Language Modeling Baselines
54
 
55
- EleutherAI is currently in the process of carrying out further evaluations of GPT-Neo. The following table should be considered a work-in-progress. If you would like to contribute evaluations you have done, please reach out on our Discord.
 
 
 
 
 
56
 
57
- | Model and Size | Pile BPB | Pile PPL | Wikitext PPL. |
58
- | ---------------- | ------------- | ------------- | -------------- |
59
- | **GPT-Neo 1.3B** | **0.7527** | **6.159** | **13.10** |
60
- | GPT-3 1.3B | ------ | ----- | ----- |
61
- | GPT-2 1.5B | 1.0468 | ----- | 17.48 |
62
- | GPT-Neo 2.7B | 0.7165 | 5.646 | 11.39 |
63
- | GPT-3 2.7B | 0.9631 | ----- | ----- |
64
- | GPT-3 175B | 0.7177 | ----- | ----- |
65
 
66
- All GPT-2 and GPT-3 scores are from their respective papers, except for the Pile test results which are from the Pile paper.
 
 
 
 
 
67
 
68
  ### Down-Stream Applications
69
 
 
 
70
  ### BibTeX entry and citation info
71
 
72
  ```bibtex
 
48
  GPT-Neo was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending on your usecase GPT-Neo may produce socially unacceptable text. See Sections 5 and 6 of the Pile paper for a more detailed analysis of the biases in the Pile.
49
 
50
  As with all language models, it is hard to predict in advance how GPT-Neo will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
51
+
52
  ## Eval results
53
 
54
+ ### Linguistic Reasoning
55
 
56
+ | Model and Size | Pile BPB | Pile PPL | Wikitext PPL | Lambada PPL | Lambada Acc | Winogrande | Hellaswag |
57
+ | ---------------- | ---------- | ---------- | ------------- | ----------- | ----------- | ---------- | ----------- |
58
+ | **GPT-Neo 1.3B** | **0.7527** | **6.159** | **13.10** | **7.498** | **57.23%** | **55.01%** | **38.66%** |
59
+ | GPT-2 1.5B | 1.0468 | ----- | 17.48 | 10.634 | 51.21% | 59.40% | 40.03% |
60
+ | GPT-Neo 2.7B | 0.7165 | 5.646 | 11.39 | 5.626 | 62.22% | 56.50% | 42.73% |
61
+ | GPT-3 Ada | 0.9631 | ----- | ----- | 9.954 | 51.60% | 52.90% | 35.93% |
62
 
63
+ ### Physical and Scientific Reasoning
 
 
 
 
 
 
 
64
 
65
+ | Model and Size | MathQA | PubMedQA | Piqa |
66
+ | ---------------- | ---------- | ---------- | ----------- |
67
+ | **GPT-Neo 1.3B** | **24.05%** | **54.40%** | **71.11%** |
68
+ | GPT-2 1.5B | 23.64% | 58.33% | 70.78% |
69
+ | GPT-Neo 2.7B | 24.72% | 57.54% | 72.14% |
70
+ | GPT-3 Ada | 24.29% | 52.80% | 68.88% |
71
 
72
  ### Down-Stream Applications
73
 
74
+ TBD
75
+
76
  ### BibTeX entry and citation info
77
 
78
  ```bibtex