RichardErkhov commited on
Commit
49b6d8c
ยท
verified ยท
1 Parent(s): 653a8a8

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ open-llama-2-ko-7b - bnb 8bits
11
+ - Model creator: https://huggingface.co/beomi/
12
+ - Original model: https://huggingface.co/beomi/open-llama-2-ko-7b/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language:
20
+ - ko
21
+ - en
22
+ pipeline_tag: text-generation
23
+ inference: false
24
+ tags:
25
+ - facebook
26
+ - meta
27
+ - pytorch
28
+ - llama
29
+ - llama-2
30
+ - kollama
31
+ - llama-2-ko
32
+ license: mit
33
+ library_name: transformers
34
+ ---
35
+
36
+ **Update Log**
37
+
38
+ - 2023.12.14: Initial Release of Open-Llama-2-Ko
39
+
40
+ # **Open-Llama-2-Ko** ๐Ÿฆ™๐Ÿ‡ฐ๐Ÿ‡ท
41
+
42
+ Open-Llama-2-Ko represents an advanced iteration of the Llama 2 model, featuring an expanded vocabulary and the inclusion of a Korean corpus for enhanced pretraining. Similar to its predecessor, Llama-2-Ko, this model operates within the range of generative text models, with parameter counts ranging from 7 billion to 70 billion. The focus of this repository is on the 7B pretrained version, designed to integrate seamlessly with the Hugging Face Transformers format.
43
+
44
+ The primary distinction between the Llama-2-Ko Series and Open-Llama-2-Ko lies in the dataset. Open-Llama-2-Ko exclusively utilizes publicly accessible Korean corpora, including sources such as [AI Hub](https://www.aihub.or.kr), [Modu Corpus, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
45
+
46
+ As training was conducted solely with publicly available corpora, this model is open for unrestricted use by everyone, adhering to the MIT License*.
47
+
48
+ *MIT License under LLAMA 2 COMMUNITY LICENSE AGREEMENT
49
+
50
+ ## Model Details
51
+
52
+ **Model Developers:** Junbum Lee (Beomi)
53
+
54
+ **Variations:** Open-Llama-2-Ko will be available in different parameter sizes โ€” 7B and 13B โ€” along with various pretrained options.
55
+
56
+ **Input:** The model accepts only text input.
57
+
58
+ **Output:** The model produces text output exclusively.
59
+
60
+ **Model Architecture:**
61
+
62
+ Open-Llama-2-Ko is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
63
+
64
+ | |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
65
+ |---|---|---|---|---|---|---|
66
+ |Llama 2|*A curated mix of Publicly Accessible Korean Corpora*|7B|2k|โœ˜|>15B*|5e<sup>-5</sup>|
67
+
68
+ **Training Corpus**
69
+
70
+ The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:
71
+
72
+ - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
73
+ - Only the `Training` segment of the data was used.
74
+ - The `Validation` and `Test` segments were deliberately excluded.
75
+ - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
76
+
77
+ The final JSONL dataset used to train this model is approximately 61GB in size.
78
+
79
+ Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original Llama tokenizer, >60 billion tokens.)
80
+
81
+ **Vocab Expansion**
82
+
83
+ | Model Name | Vocabulary Size | Description |
84
+ | --- | --- | --- |
85
+ | Original Llama-2 | 32000 | Sentencepiece BPE |
86
+ | **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
87
+
88
+ **Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."**
89
+
90
+ | Model | Tokens |
91
+ | --- | --- |
92
+ | Llama-2 | `['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”']` |
93
+ | Llama-2-Ko | `['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”']` |
94
+
95
+ **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
96
+
97
+ | Model | Tokens |
98
+ | --- | --- |
99
+ | Llama-2 | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
100
+ | Llama-2-Ko | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
101
+
102
+ # LICENSE
103
+
104
+ [MIT License under LLAMA 2 COMMUNITY LICENSE AGREEMENT](./LICENSE)
105
+
106
+ # **Model Benchmark**
107
+
108
+ ## LM Eval Harness - Korean (polyglot branch)
109
+
110
+ - Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
111
+
112
+ TBD
113
+
114
+ ## Citation
115
+
116
+ TBD
117
+
118
+ ## Acknowledgements
119
+
120
+ - Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.
121
+ - The training corpus includes data from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
122
+
123
+