fblgit
/

juanako-7b-UNA

+---
+license: apache-2.0
+tags:
+- alignment-handbook
+- generated_from_trainer
+- juanako
+- mistral
+- UNA
+datasets:
+- HuggingFaceH4/ultrafeedback_binarized
+model-index:
+- name: juanako-7b-UNA
+  results:
+  - task:
+      type: text-generation
+      name: TruthfulQA (MC2)
+    dataset:
+      type: text-generation
+      name: truthful_qa
+      config: multiple_choice
+      split: validation
+    metrics:
+      - type: accuracy
+        value: 65.49
+  - task:
+      type: text-generation
+      name: ARC-Challenge
+    dataset:
+      type: text-generation
+      name: ai2_arc
+      config: ARC-Challenge
+      split: test
+    metrics:
+      - type: accuracy
+        value: 68.09
+  - task:
+      type: text-generation
+      name: HellaSwag
+    dataset:
+      type: text-generation
+      name: Rowan/hellaswag
+      split: test
+    metrics:
+      - type: accuracy
+        value: 85.20
+  - task:
+      type: text-generation
+      name: GSM8k
+    dataset:
+      type: text-generation
+      name: gsm8k
+      config: main
+      split: test
+    metrics:
+      - type: accuracy
+        value: 48.98
+  - task:
+      type: text-generation
+      name: Winogrande
+    dataset:
+      type: text-generation
+      name: winogrande
+      config: winogrande_debiased
+      split: test
+    metrics:
+      - type: accuracy
+        value: 76.8
+  - task:
+      type: text-generation
+      name: MMLU
+    dataset:
+      type: text-generation
+      name: cais/mmlu
+      config: all
+      split: test
+    metrics:
+      - type: accuracy
+        value: 61.37
+  - task:
+      type: text-generation
+      name: PiQA
+    dataset:
+      type: text-generation
+      name: piqa
+      split: test
+    metrics:
+      - type: accuracy
+        value: 83.57
+  - task:
+      type: text-generation
+      name: DROP
+    dataset:
+      type: text-generation
+      name: drop
+      split: validation
+    metrics:
+      - type: accuracy
+        value: 49.8
+  - task:
+      type: text-generation
+      name: PubMedQA
+    dataset:
+      type: text-generation
+      name: bigbio/pubmed_qa
+      config: pubmed_qa_artificial_bigbio_qa
+      split: validation
+    metrics:
+      - type: accuracy
+        value: 76.0
+---
+# juanako-7b-UNA-v2
+This model is a fine-tuned version of [fblgit/juanako-7b-UNA-v2-phase-1](https://huggingface.co/fblgit/juanako-7b-UNA-v2-phase-1) on the HuggingFaceH4/ultrafeedback_binarized dataset.
+It outperforms in many aspects most of the current Mistral based models.
+## Scoring and records (26-November-2023)
+Here are some results:
+* Scores #1 7B Model
+* Scores #4 GSM8k
+* Scores #2 in TruthfulQA
+* Scores #6 in CoPa
+* Scores #2 in PiQA
+* Scores #9 in BoolQ
+Many evaluations were performed, but it behaves very balanced in multiple fields. Feel free to submit more evaluation results.
+It scores: **65.1** according HuggingFace LLM Leaderboard.
+## Model description
+juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published.
+## TruthfulQA 0-Shot
+```
+|    Tasks     |Version|Filter|Metric|Value |   |Stderr|
+|--------------|-------|------|------|-----:|---|-----:|
+|truthfulqa_mc2|Yaml   |none  |acc   |0.6549|±  |0.0153|
+```
+## ARC 25-Shot
+```
+|    Tasks    |Version|Filter| Metric |Value |   |Stderr|
+|-------------|-------|------|--------|-----:|---|-----:|
+|arc_challenge|Yaml   |none  |acc     |0.6476|±  |0.0140|
+|             |       |none  |acc_norm|0.6809|±  |0.0136|
+```
+## HellaSwag 10-Shot
+```
+|  Tasks  |Version|Filter| Metric |Value |   |Stderr|
+|---------|-------|------|--------|-----:|---|-----:|
+|hellaswag|Yaml   |none  |acc     |0.6703|±  |0.0047|
+|         |       |none  |acc_norm|0.8520|±  |0.0035|
+```
+## GSM8k 5-Shot
+```
+|Tasks|Version|  Filter  |  Metric   |Value |   |Stderr|
+|-----|-------|----------|-----------|-----:|---|-----:|
+|gsm8k|Yaml   |get-answer|exact_match|0.4898|±  |0.0138|
+```
+## GPT Evaluations 0-Shot
+```
+|    Tasks     |Version|Filter|  Metric  |Value |   |Stderr|
+|--------------|-------|------|----------|-----:|---|-----:|
+|boolq         |Yaml   |none  |acc       |0.8703|±  |0.0059|
+|lambada_openai|Yaml   |none  |perplexity|3.2598|±  |0.0705|
+|              |       |none  |acc       |0.7336|±  |0.0062|
+|piqa          |Yaml   |none  |acc       |0.8254|±  |0.0089|
+|              |       |none  |acc_norm  |0.8292|±  |0.0088|
+|sciq          |Yaml   |none  |acc       |0.9580|±  |0.0063|
+|              |       |none  |acc_norm  |0.9130|±  |0.0089|
+```
+## MathQA 0-Shot
+```
+|Tasks |Version|Filter| Metric |Value |   |Stderr|
+|------|-------|------|--------|-----:|---|-----:|
+|mathqa|Yaml   |none  |acc     |0.3752|±  |0.0089|
+|      |       |none  |acc_norm|0.3772|±  |0.0089|
+```
+## PiQa 1-Shot
+```
+|Tasks|Version|Filter| Metric |Value |   |Stderr|
+|-----|-------|------|--------|-----:|---|-----:|
+|piqa |Yaml   |none  |acc     |0.8308|±  |0.0087|
+|     |       |none  |acc_norm|0.8357|±  |0.0086|
+```
+## Winogrande 5-Shot
+```
+|  Tasks   |Version|Filter|Metric|Value|   |Stderr|
+|----------|-------|------|------|----:|---|-----:|
+|winogrande|Yaml   |none  |acc   |0.768|±  |0.0119|
+```
+## PubMedQA 0-Shot
+```
+| Tasks  |Version|Filter|Metric|Value|   |Stderr|
+|--------|-------|------|------|----:|---|-----:|
+|pubmedqa|Yaml   |none  |acc   | 0.76|±  |0.0191|
+```
+## RACE 1-Shot
+```
+|Tasks|Version|Filter|Metric|Value |   |Stderr|
+|-----|-------|------|------|-----:|---|-----:|
+|race |Yaml   |none  |acc   |0.5282|±  |0.0154|
+```
+## MMLU 5-Shot (8-Bit)
+```
+|      Groups      |Version|Filter|Metric|Value |   |Stderr|
+|------------------|-------|------|------|-----:|---|-----:|
+|mmlu              |N/A    |none  |acc   |0.6137|±  |0.1243|
+| - humanities     |N/A    |none  |acc   |0.5671|±  |0.1101|
+| - other          |N/A    |none  |acc   |0.6859|±  |0.1164|
+| - social_sciences|N/A    |none  |acc   |0.7195|±  |0.0713|
+| - stem           |N/A    |none  |acc   |0.5087|±  |0.1297|
+```
+## DROP 3-Shot (8-Bit) (Instruct-Eval)
+```
+{'score': 0.49801113762927607}
+{'drop': 49.8}
+drop: 49.8
+```
+## CRASS 0-Shot (Instruct-Eval)
+```
+{'score': 0.8357664233576643}
+{'crass': 83.58}
+crass: 83.58
+```
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.0001
+- train_batch_size: 1
+- eval_batch_size: 1
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 14
+- gradient_accumulation_steps: 16
+- total_train_batch_size: 224
+- total_eval_batch_size: 14
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.01
+- num_epochs: 1
+### Training results
+| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
+|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
+| 0.4795        | 0.2   | 56   | 0.4958          | -1.3684        | -2.6385          | 0.7552             | 1.2701          | -265.3887      | -241.2612    | -2.2572         | -2.4922       |
+| 0.4642        | 0.4   | 112  | 0.4859          | -1.0380        | -1.9769          | 0.7273             | 0.9389          | -258.7718      | -237.9569    | -2.2414         | -2.4751       |
+| 0.4758        | 0.61  | 168  | 0.4808          | -1.2594        | -2.3704          | 0.7343             | 1.1110          | -262.7074      | -240.1708    | -2.2305         | -2.4633       |
+| 0.4549        | 0.81  | 224  | 0.4768          | -1.1906        | -2.3201          | 0.7552             | 1.1295          | -262.2044      | -239.4827    | -2.2284         | -2.4610       |
+### Framework versions
+- Transformers 4.35.0-UNA
+- Pytorch 2.1.0
+- Datasets 2.14.6
+- Tokenizers 0.14.1
+## Citations
+```
+@misc{lin2021truthfulqa,
+    title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
+    author={Stephanie Lin and Jacob Hilton and Owain Evans},
+    year={2021},
+    eprint={2109.07958},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+Author [Xavier M.](mailto:[email protected]) @fblgit