fblgit commited on
Commit
34e4844
1 Parent(s): 5f6fd2e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +273 -0
README.md ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - alignment-handbook
5
+ - generated_from_trainer
6
+ - juanako
7
+ - mistral
8
+ - UNA
9
+ datasets:
10
+ - HuggingFaceH4/ultrafeedback_binarized
11
+ model-index:
12
+ - name: juanako-7b-UNA
13
+ results:
14
+ - task:
15
+ type: text-generation
16
+ name: TruthfulQA (MC2)
17
+ dataset:
18
+ type: text-generation
19
+ name: truthful_qa
20
+ config: multiple_choice
21
+ split: validation
22
+ metrics:
23
+ - type: accuracy
24
+ value: 65.49
25
+ - task:
26
+ type: text-generation
27
+ name: ARC-Challenge
28
+ dataset:
29
+ type: text-generation
30
+ name: ai2_arc
31
+ config: ARC-Challenge
32
+ split: test
33
+ metrics:
34
+ - type: accuracy
35
+ value: 68.09
36
+ - task:
37
+ type: text-generation
38
+ name: HellaSwag
39
+ dataset:
40
+ type: text-generation
41
+ name: Rowan/hellaswag
42
+ split: test
43
+ metrics:
44
+ - type: accuracy
45
+ value: 85.20
46
+ - task:
47
+ type: text-generation
48
+ name: GSM8k
49
+ dataset:
50
+ type: text-generation
51
+ name: gsm8k
52
+ config: main
53
+ split: test
54
+ metrics:
55
+ - type: accuracy
56
+ value: 48.98
57
+ - task:
58
+ type: text-generation
59
+ name: Winogrande
60
+ dataset:
61
+ type: text-generation
62
+ name: winogrande
63
+ config: winogrande_debiased
64
+ split: test
65
+ metrics:
66
+ - type: accuracy
67
+ value: 76.8
68
+ - task:
69
+ type: text-generation
70
+ name: MMLU
71
+ dataset:
72
+ type: text-generation
73
+ name: cais/mmlu
74
+ config: all
75
+ split: test
76
+ metrics:
77
+ - type: accuracy
78
+ value: 61.37
79
+ - task:
80
+ type: text-generation
81
+ name: PiQA
82
+ dataset:
83
+ type: text-generation
84
+ name: piqa
85
+ split: test
86
+ metrics:
87
+ - type: accuracy
88
+ value: 83.57
89
+ - task:
90
+ type: text-generation
91
+ name: DROP
92
+ dataset:
93
+ type: text-generation
94
+ name: drop
95
+ split: validation
96
+ metrics:
97
+ - type: accuracy
98
+ value: 49.8
99
+ - task:
100
+ type: text-generation
101
+ name: PubMedQA
102
+ dataset:
103
+ type: text-generation
104
+ name: bigbio/pubmed_qa
105
+ config: pubmed_qa_artificial_bigbio_qa
106
+ split: validation
107
+ metrics:
108
+ - type: accuracy
109
+ value: 76.0
110
+ ---
111
+
112
+ # juanako-7b-UNA-v2
113
+
114
+ This model is a fine-tuned version of [fblgit/juanako-7b-UNA-v2-phase-1](https://huggingface.co/fblgit/juanako-7b-UNA-v2-phase-1) on the HuggingFaceH4/ultrafeedback_binarized dataset.
115
+ It outperforms in many aspects most of the current Mistral based models.
116
+
117
+ ## Scoring and records (26-November-2023)
118
+ Here are some results:
119
+ * Scores #1 7B Model
120
+ * Scores #4 GSM8k
121
+ * Scores #2 in TruthfulQA
122
+ * Scores #6 in CoPa
123
+ * Scores #2 in PiQA
124
+ * Scores #9 in BoolQ
125
+
126
+ Many evaluations were performed, but it behaves very balanced in multiple fields. Feel free to submit more evaluation results.
127
+
128
+ It scores: **65.1** according HuggingFace LLM Leaderboard.
129
+
130
+ ## Model description
131
+
132
+ juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published.
133
+
134
+ ## TruthfulQA 0-Shot
135
+ ```
136
+ | Tasks |Version|Filter|Metric|Value | |Stderr|
137
+ |--------------|-------|------|------|-----:|---|-----:|
138
+ |truthfulqa_mc2|Yaml |none |acc |0.6549|± |0.0153|
139
+ ```
140
+ ## ARC 25-Shot
141
+ ```
142
+ | Tasks |Version|Filter| Metric |Value | |Stderr|
143
+ |-------------|-------|------|--------|-----:|---|-----:|
144
+ |arc_challenge|Yaml |none |acc |0.6476|± |0.0140|
145
+ | | |none |acc_norm|0.6809|± |0.0136|
146
+ ```
147
+ ## HellaSwag 10-Shot
148
+ ```
149
+ | Tasks |Version|Filter| Metric |Value | |Stderr|
150
+ |---------|-------|------|--------|-----:|---|-----:|
151
+ |hellaswag|Yaml |none |acc |0.6703|± |0.0047|
152
+ | | |none |acc_norm|0.8520|± |0.0035|
153
+ ```
154
+ ## GSM8k 5-Shot
155
+ ```
156
+ |Tasks|Version| Filter | Metric |Value | |Stderr|
157
+ |-----|-------|----------|-----------|-----:|---|-----:|
158
+ |gsm8k|Yaml |get-answer|exact_match|0.4898|± |0.0138|
159
+ ```
160
+ ## GPT Evaluations 0-Shot
161
+ ```
162
+ | Tasks |Version|Filter| Metric |Value | |Stderr|
163
+ |--------------|-------|------|----------|-----:|---|-----:|
164
+ |boolq |Yaml |none |acc |0.8703|± |0.0059|
165
+ |lambada_openai|Yaml |none |perplexity|3.2598|± |0.0705|
166
+ | | |none |acc |0.7336|± |0.0062|
167
+ |piqa |Yaml |none |acc |0.8254|± |0.0089|
168
+ | | |none |acc_norm |0.8292|± |0.0088|
169
+ |sciq |Yaml |none |acc |0.9580|± |0.0063|
170
+ | | |none |acc_norm |0.9130|± |0.0089|
171
+ ```
172
+ ## MathQA 0-Shot
173
+ ```
174
+ |Tasks |Version|Filter| Metric |Value | |Stderr|
175
+ |------|-------|------|--------|-----:|---|-----:|
176
+ |mathqa|Yaml |none |acc |0.3752|± |0.0089|
177
+ | | |none |acc_norm|0.3772|± |0.0089|
178
+ ```
179
+ ## PiQa 1-Shot
180
+ ```
181
+ |Tasks|Version|Filter| Metric |Value | |Stderr|
182
+ |-----|-------|------|--------|-----:|---|-----:|
183
+ |piqa |Yaml |none |acc |0.8308|± |0.0087|
184
+ | | |none |acc_norm|0.8357|± |0.0086|
185
+ ```
186
+ ## Winogrande 5-Shot
187
+ ```
188
+ | Tasks |Version|Filter|Metric|Value| |Stderr|
189
+ |----------|-------|------|------|----:|---|-----:|
190
+ |winogrande|Yaml |none |acc |0.768|± |0.0119|
191
+ ```
192
+ ## PubMedQA 0-Shot
193
+ ```
194
+ | Tasks |Version|Filter|Metric|Value| |Stderr|
195
+ |--------|-------|------|------|----:|---|-----:|
196
+ |pubmedqa|Yaml |none |acc | 0.76|± |0.0191|
197
+ ```
198
+ ## RACE 1-Shot
199
+ ```
200
+ |Tasks|Version|Filter|Metric|Value | |Stderr|
201
+ |-----|-------|------|------|-----:|---|-----:|
202
+ |race |Yaml |none |acc |0.5282|± |0.0154|
203
+ ```
204
+ ## MMLU 5-Shot (8-Bit)
205
+ ```
206
+ | Groups |Version|Filter|Metric|Value | |Stderr|
207
+ |------------------|-------|------|------|-----:|---|-----:|
208
+ |mmlu |N/A |none |acc |0.6137|± |0.1243|
209
+ | - humanities |N/A |none |acc |0.5671|± |0.1101|
210
+ | - other |N/A |none |acc |0.6859|± |0.1164|
211
+ | - social_sciences|N/A |none |acc |0.7195|± |0.0713|
212
+ | - stem |N/A |none |acc |0.5087|± |0.1297|
213
+ ```
214
+ ## DROP 3-Shot (8-Bit) (Instruct-Eval)
215
+ ```
216
+ {'score': 0.49801113762927607}
217
+ {'drop': 49.8}
218
+ drop: 49.8
219
+ ```
220
+
221
+ ## CRASS 0-Shot (Instruct-Eval)
222
+ ```
223
+ {'score': 0.8357664233576643}
224
+ {'crass': 83.58}
225
+ crass: 83.58
226
+ ```
227
+ ### Training hyperparameters
228
+
229
+ The following hyperparameters were used during training:
230
+ - learning_rate: 0.0001
231
+ - train_batch_size: 1
232
+ - eval_batch_size: 1
233
+ - seed: 42
234
+ - distributed_type: multi-GPU
235
+ - num_devices: 14
236
+ - gradient_accumulation_steps: 16
237
+ - total_train_batch_size: 224
238
+ - total_eval_batch_size: 14
239
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
240
+ - lr_scheduler_type: linear
241
+ - lr_scheduler_warmup_ratio: 0.01
242
+ - num_epochs: 1
243
+
244
+ ### Training results
245
+
246
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
247
+ |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
248
+ | 0.4795 | 0.2 | 56 | 0.4958 | -1.3684 | -2.6385 | 0.7552 | 1.2701 | -265.3887 | -241.2612 | -2.2572 | -2.4922 |
249
+ | 0.4642 | 0.4 | 112 | 0.4859 | -1.0380 | -1.9769 | 0.7273 | 0.9389 | -258.7718 | -237.9569 | -2.2414 | -2.4751 |
250
+ | 0.4758 | 0.61 | 168 | 0.4808 | -1.2594 | -2.3704 | 0.7343 | 1.1110 | -262.7074 | -240.1708 | -2.2305 | -2.4633 |
251
+ | 0.4549 | 0.81 | 224 | 0.4768 | -1.1906 | -2.3201 | 0.7552 | 1.1295 | -262.2044 | -239.4827 | -2.2284 | -2.4610 |
252
+
253
+
254
+ ### Framework versions
255
+
256
+ - Transformers 4.35.0-UNA
257
+ - Pytorch 2.1.0
258
+ - Datasets 2.14.6
259
+ - Tokenizers 0.14.1
260
+
261
+ ## Citations
262
+ ```
263
+ @misc{lin2021truthfulqa,
264
+ title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
265
+ author={Stephanie Lin and Jacob Hilton and Owain Evans},
266
+ year={2021},
267
+ eprint={2109.07958},
268
+ archivePrefix={arXiv},
269
+ primaryClass={cs.CL}
270
+ }
271
+ ```
272
+
273
+ Author [Xavier M.](mailto:[email protected]) @fblgit