File size: 5,911 Bytes
0a431ca eaf5035 0a431ca 620763e abaf46e 0a431ca a512774 0a431ca 4943ab5 0a431ca 1324c8a 0a431ca 1324c8a 0a431ca 1324c8a 6082766 0a431ca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
extra_gated_heading: Access beomi/Yi-Ko-34B on Hugging Face
extra_gated_button_content: Submit
extra_gated_fields:
I agree to share my name, email address and username: checkbox
I confirm that I understand this project is for research purposes only, and confirm that I agree to follow the LICENSE of this model: checkbox
language:
- en
- ko
pipeline_tag: text-generation
inference: false
tags:
- pytorch
- Yi-Ko
- 01-ai
- Yi
library_name: transformers
license: apache-2.0
---
**Update**
@2024.07.08: Update LICENSE into Apache 2.0!🎉
# **beomi/Yi-Ko-34B**
Yi-Ko series models serve as advanced iterations of 01-ai/Yi models,
benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining.
Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters.
This repository focuses on the **34B** pretrained version,
which is tailored to fit the Hugging Face Transformers format.
For access to the other models, feel free to consult the index provided below.
## Model Details
**Model Developers** Junbum Lee (Beomi)
**Variations** Yi-Ko-34B will come in a range of parameter sizes — 6B and 34B — with Ko(Korean+English).
**Input** Models input text only.
**Output** Models generate text only.
**Model Architecture**
Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.
<small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small>
|Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Train tokens (per batch)|
|---|---|---|---|---|---|---|---|
|Yi-Ko-34B|*A mix of Korean + English online data*|34B|4k|O|40B+|5e<sup>-5</sup>|4M|
**Vocab Expansion**
| Model Name | Vocabulary Size | Description |
| --- | --- | --- |
| Original Yi-Series | 64000 | Sentencepiece BPE |
| **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges |
**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"**
| Model | # of tokens | Tokens |
| --- | --- | --- |
| Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
| **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` |
|<small>*Equal Korean vocab with Llama-2-Ko Series</small>||
**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
| Model | # of tokens | Tokens |
| --- | --- | --- |
| Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
| **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
|<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>|
# **Model Benchmark**
## LM Eval Harness - Korean Benchmarks
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|----------------|------:|------|-----:|--------|-----:|---|------|
|**kmmlu_direct**|N/A |none | 5|exact_match|**0.5027**|± |0.1019|
|kobest_boolq | 1|none | 5|acc |0.9202|± |0.0072|
| | |none | 5|f1 |0.9202|± |N/A |
|kobest_copa | 1|none | 5|acc |0.8480|± |0.0114|
| | |none | 5|f1 |0.8479|± |N/A |
|kobest_hellaswag| 1|none | 5|acc |0.5320|± |0.0223|
| | |none | 5|f1 |0.5281|± |N/A |
| | |none | 5|acc_norm|0.6340|± |0.0216|
|kobest_sentineg | 1|none | 5|acc |0.9874|± |0.0056|
| | |none | 5|f1 |0.9874|± |N/A |
|haerae |N/A |none | 5|acc |0.7965|± |0.0116|
| | |none | 5|acc_norm|0.7965|± |0.0116|
| - haerae_general_knowledge | 1|none | 5|acc |0.5114|± |0.0378|
| | |none | 5|acc_norm|0.5114|± |0.0378|
| - haerae_history | 1|none | 5|acc |0.8511|± |0.0260|
| | |none | 5|acc_norm|0.8511|± |0.0260|
| - haerae_loan_word | 1|none | 5|acc |0.8402|± |0.0283|
| | |none | 5|acc_norm|0.8402|± |0.0283|
| - haerae_rare_word | 1|none | 5|acc |0.8642|± |0.0170|
| | |none | 5|acc_norm|0.8642|± |0.0170|
| - haerae_standard_nomenclature| 1|none | 5|acc |0.8301|± |0.0305|
| | |none | 5|acc_norm|0.8301|± |0.0305|
## LICENSE
Apache 2.0 (for research)
> For commercial purpose,
> mailto: [email protected] to acquire Yi-Ko sereis commercial license.
## Acknowledgement
The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program. |