Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,273 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
tags:
|
4 |
+
- alignment-handbook
|
5 |
+
- generated_from_trainer
|
6 |
+
- juanako
|
7 |
+
- mistral
|
8 |
+
- UNA
|
9 |
+
datasets:
|
10 |
+
- HuggingFaceH4/ultrafeedback_binarized
|
11 |
+
model-index:
|
12 |
+
- name: juanako-7b-UNA
|
13 |
+
results:
|
14 |
+
- task:
|
15 |
+
type: text-generation
|
16 |
+
name: TruthfulQA (MC2)
|
17 |
+
dataset:
|
18 |
+
type: text-generation
|
19 |
+
name: truthful_qa
|
20 |
+
config: multiple_choice
|
21 |
+
split: validation
|
22 |
+
metrics:
|
23 |
+
- type: accuracy
|
24 |
+
value: 65.49
|
25 |
+
- task:
|
26 |
+
type: text-generation
|
27 |
+
name: ARC-Challenge
|
28 |
+
dataset:
|
29 |
+
type: text-generation
|
30 |
+
name: ai2_arc
|
31 |
+
config: ARC-Challenge
|
32 |
+
split: test
|
33 |
+
metrics:
|
34 |
+
- type: accuracy
|
35 |
+
value: 68.09
|
36 |
+
- task:
|
37 |
+
type: text-generation
|
38 |
+
name: HellaSwag
|
39 |
+
dataset:
|
40 |
+
type: text-generation
|
41 |
+
name: Rowan/hellaswag
|
42 |
+
split: test
|
43 |
+
metrics:
|
44 |
+
- type: accuracy
|
45 |
+
value: 85.20
|
46 |
+
- task:
|
47 |
+
type: text-generation
|
48 |
+
name: GSM8k
|
49 |
+
dataset:
|
50 |
+
type: text-generation
|
51 |
+
name: gsm8k
|
52 |
+
config: main
|
53 |
+
split: test
|
54 |
+
metrics:
|
55 |
+
- type: accuracy
|
56 |
+
value: 48.98
|
57 |
+
- task:
|
58 |
+
type: text-generation
|
59 |
+
name: Winogrande
|
60 |
+
dataset:
|
61 |
+
type: text-generation
|
62 |
+
name: winogrande
|
63 |
+
config: winogrande_debiased
|
64 |
+
split: test
|
65 |
+
metrics:
|
66 |
+
- type: accuracy
|
67 |
+
value: 76.8
|
68 |
+
- task:
|
69 |
+
type: text-generation
|
70 |
+
name: MMLU
|
71 |
+
dataset:
|
72 |
+
type: text-generation
|
73 |
+
name: cais/mmlu
|
74 |
+
config: all
|
75 |
+
split: test
|
76 |
+
metrics:
|
77 |
+
- type: accuracy
|
78 |
+
value: 61.37
|
79 |
+
- task:
|
80 |
+
type: text-generation
|
81 |
+
name: PiQA
|
82 |
+
dataset:
|
83 |
+
type: text-generation
|
84 |
+
name: piqa
|
85 |
+
split: test
|
86 |
+
metrics:
|
87 |
+
- type: accuracy
|
88 |
+
value: 83.57
|
89 |
+
- task:
|
90 |
+
type: text-generation
|
91 |
+
name: DROP
|
92 |
+
dataset:
|
93 |
+
type: text-generation
|
94 |
+
name: drop
|
95 |
+
split: validation
|
96 |
+
metrics:
|
97 |
+
- type: accuracy
|
98 |
+
value: 49.8
|
99 |
+
- task:
|
100 |
+
type: text-generation
|
101 |
+
name: PubMedQA
|
102 |
+
dataset:
|
103 |
+
type: text-generation
|
104 |
+
name: bigbio/pubmed_qa
|
105 |
+
config: pubmed_qa_artificial_bigbio_qa
|
106 |
+
split: validation
|
107 |
+
metrics:
|
108 |
+
- type: accuracy
|
109 |
+
value: 76.0
|
110 |
+
---
|
111 |
+
|
112 |
+
# juanako-7b-UNA-v2
|
113 |
+
|
114 |
+
This model is a fine-tuned version of [fblgit/juanako-7b-UNA-v2-phase-1](https://huggingface.co/fblgit/juanako-7b-UNA-v2-phase-1) on the HuggingFaceH4/ultrafeedback_binarized dataset.
|
115 |
+
It outperforms in many aspects most of the current Mistral based models.
|
116 |
+
|
117 |
+
## Scoring and records (26-November-2023)
|
118 |
+
Here are some results:
|
119 |
+
* Scores #1 7B Model
|
120 |
+
* Scores #4 GSM8k
|
121 |
+
* Scores #2 in TruthfulQA
|
122 |
+
* Scores #6 in CoPa
|
123 |
+
* Scores #2 in PiQA
|
124 |
+
* Scores #9 in BoolQ
|
125 |
+
|
126 |
+
Many evaluations were performed, but it behaves very balanced in multiple fields. Feel free to submit more evaluation results.
|
127 |
+
|
128 |
+
It scores: **65.1** according HuggingFace LLM Leaderboard.
|
129 |
+
|
130 |
+
## Model description
|
131 |
+
|
132 |
+
juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published.
|
133 |
+
|
134 |
+
## TruthfulQA 0-Shot
|
135 |
+
```
|
136 |
+
| Tasks |Version|Filter|Metric|Value | |Stderr|
|
137 |
+
|--------------|-------|------|------|-----:|---|-----:|
|
138 |
+
|truthfulqa_mc2|Yaml |none |acc |0.6549|± |0.0153|
|
139 |
+
```
|
140 |
+
## ARC 25-Shot
|
141 |
+
```
|
142 |
+
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
143 |
+
|-------------|-------|------|--------|-----:|---|-----:|
|
144 |
+
|arc_challenge|Yaml |none |acc |0.6476|± |0.0140|
|
145 |
+
| | |none |acc_norm|0.6809|± |0.0136|
|
146 |
+
```
|
147 |
+
## HellaSwag 10-Shot
|
148 |
+
```
|
149 |
+
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
150 |
+
|---------|-------|------|--------|-----:|---|-----:|
|
151 |
+
|hellaswag|Yaml |none |acc |0.6703|± |0.0047|
|
152 |
+
| | |none |acc_norm|0.8520|± |0.0035|
|
153 |
+
```
|
154 |
+
## GSM8k 5-Shot
|
155 |
+
```
|
156 |
+
|Tasks|Version| Filter | Metric |Value | |Stderr|
|
157 |
+
|-----|-------|----------|-----------|-----:|---|-----:|
|
158 |
+
|gsm8k|Yaml |get-answer|exact_match|0.4898|± |0.0138|
|
159 |
+
```
|
160 |
+
## GPT Evaluations 0-Shot
|
161 |
+
```
|
162 |
+
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
163 |
+
|--------------|-------|------|----------|-----:|---|-----:|
|
164 |
+
|boolq |Yaml |none |acc |0.8703|± |0.0059|
|
165 |
+
|lambada_openai|Yaml |none |perplexity|3.2598|± |0.0705|
|
166 |
+
| | |none |acc |0.7336|± |0.0062|
|
167 |
+
|piqa |Yaml |none |acc |0.8254|± |0.0089|
|
168 |
+
| | |none |acc_norm |0.8292|± |0.0088|
|
169 |
+
|sciq |Yaml |none |acc |0.9580|± |0.0063|
|
170 |
+
| | |none |acc_norm |0.9130|± |0.0089|
|
171 |
+
```
|
172 |
+
## MathQA 0-Shot
|
173 |
+
```
|
174 |
+
|Tasks |Version|Filter| Metric |Value | |Stderr|
|
175 |
+
|------|-------|------|--------|-----:|---|-----:|
|
176 |
+
|mathqa|Yaml |none |acc |0.3752|± |0.0089|
|
177 |
+
| | |none |acc_norm|0.3772|± |0.0089|
|
178 |
+
```
|
179 |
+
## PiQa 1-Shot
|
180 |
+
```
|
181 |
+
|Tasks|Version|Filter| Metric |Value | |Stderr|
|
182 |
+
|-----|-------|------|--------|-----:|---|-----:|
|
183 |
+
|piqa |Yaml |none |acc |0.8308|± |0.0087|
|
184 |
+
| | |none |acc_norm|0.8357|± |0.0086|
|
185 |
+
```
|
186 |
+
## Winogrande 5-Shot
|
187 |
+
```
|
188 |
+
| Tasks |Version|Filter|Metric|Value| |Stderr|
|
189 |
+
|----------|-------|------|------|----:|---|-----:|
|
190 |
+
|winogrande|Yaml |none |acc |0.768|± |0.0119|
|
191 |
+
```
|
192 |
+
## PubMedQA 0-Shot
|
193 |
+
```
|
194 |
+
| Tasks |Version|Filter|Metric|Value| |Stderr|
|
195 |
+
|--------|-------|------|------|----:|---|-----:|
|
196 |
+
|pubmedqa|Yaml |none |acc | 0.76|± |0.0191|
|
197 |
+
```
|
198 |
+
## RACE 1-Shot
|
199 |
+
```
|
200 |
+
|Tasks|Version|Filter|Metric|Value | |Stderr|
|
201 |
+
|-----|-------|------|------|-----:|---|-----:|
|
202 |
+
|race |Yaml |none |acc |0.5282|± |0.0154|
|
203 |
+
```
|
204 |
+
## MMLU 5-Shot (8-Bit)
|
205 |
+
```
|
206 |
+
| Groups |Version|Filter|Metric|Value | |Stderr|
|
207 |
+
|------------------|-------|------|------|-----:|---|-----:|
|
208 |
+
|mmlu |N/A |none |acc |0.6137|± |0.1243|
|
209 |
+
| - humanities |N/A |none |acc |0.5671|± |0.1101|
|
210 |
+
| - other |N/A |none |acc |0.6859|± |0.1164|
|
211 |
+
| - social_sciences|N/A |none |acc |0.7195|± |0.0713|
|
212 |
+
| - stem |N/A |none |acc |0.5087|± |0.1297|
|
213 |
+
```
|
214 |
+
## DROP 3-Shot (8-Bit) (Instruct-Eval)
|
215 |
+
```
|
216 |
+
{'score': 0.49801113762927607}
|
217 |
+
{'drop': 49.8}
|
218 |
+
drop: 49.8
|
219 |
+
```
|
220 |
+
|
221 |
+
## CRASS 0-Shot (Instruct-Eval)
|
222 |
+
```
|
223 |
+
{'score': 0.8357664233576643}
|
224 |
+
{'crass': 83.58}
|
225 |
+
crass: 83.58
|
226 |
+
```
|
227 |
+
### Training hyperparameters
|
228 |
+
|
229 |
+
The following hyperparameters were used during training:
|
230 |
+
- learning_rate: 0.0001
|
231 |
+
- train_batch_size: 1
|
232 |
+
- eval_batch_size: 1
|
233 |
+
- seed: 42
|
234 |
+
- distributed_type: multi-GPU
|
235 |
+
- num_devices: 14
|
236 |
+
- gradient_accumulation_steps: 16
|
237 |
+
- total_train_batch_size: 224
|
238 |
+
- total_eval_batch_size: 14
|
239 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
240 |
+
- lr_scheduler_type: linear
|
241 |
+
- lr_scheduler_warmup_ratio: 0.01
|
242 |
+
- num_epochs: 1
|
243 |
+
|
244 |
+
### Training results
|
245 |
+
|
246 |
+
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|
247 |
+
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
|
248 |
+
| 0.4795 | 0.2 | 56 | 0.4958 | -1.3684 | -2.6385 | 0.7552 | 1.2701 | -265.3887 | -241.2612 | -2.2572 | -2.4922 |
|
249 |
+
| 0.4642 | 0.4 | 112 | 0.4859 | -1.0380 | -1.9769 | 0.7273 | 0.9389 | -258.7718 | -237.9569 | -2.2414 | -2.4751 |
|
250 |
+
| 0.4758 | 0.61 | 168 | 0.4808 | -1.2594 | -2.3704 | 0.7343 | 1.1110 | -262.7074 | -240.1708 | -2.2305 | -2.4633 |
|
251 |
+
| 0.4549 | 0.81 | 224 | 0.4768 | -1.1906 | -2.3201 | 0.7552 | 1.1295 | -262.2044 | -239.4827 | -2.2284 | -2.4610 |
|
252 |
+
|
253 |
+
|
254 |
+
### Framework versions
|
255 |
+
|
256 |
+
- Transformers 4.35.0-UNA
|
257 |
+
- Pytorch 2.1.0
|
258 |
+
- Datasets 2.14.6
|
259 |
+
- Tokenizers 0.14.1
|
260 |
+
|
261 |
+
## Citations
|
262 |
+
```
|
263 |
+
@misc{lin2021truthfulqa,
|
264 |
+
title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
|
265 |
+
author={Stephanie Lin and Jacob Hilton and Owain Evans},
|
266 |
+
year={2021},
|
267 |
+
eprint={2109.07958},
|
268 |
+
archivePrefix={arXiv},
|
269 |
+
primaryClass={cs.CL}
|
270 |
+
}
|
271 |
+
```
|
272 |
+
|
273 |
+
Author [Xavier M.](mailto:[email protected]) @fblgit
|