JeongwonChoi commited on
Commit
4824cb8
β€’
1 Parent(s): 5ddf8ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -35
README.md CHANGED
@@ -1,77 +1,123 @@
1
  ---
2
  tags:
3
- - text-generation
4
  license: cc-by-nc-sa-4.0
5
  language:
6
- - ko
7
  base_model: yanolja/KoSOLAR-10.7B-v0.1
8
  pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
  # **DataVortexS-10.7B-v0.4**
12
- <img src="./DataVortex.png" alt="DataVortex" style="height: 8em;">
13
-
14
- ## **License**
15
 
16
- [cc-by-nc-sa-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
17
 
18
  ## **Model Details**
19
 
20
  ### **Base Model**
 
21
  [yanolja/KoSOLAR-10.7B-v0.1](https://huggingface.co/yanolja/KoSOLAR-10.7B-v0.1) _(Tokenizer Issue Fixed Version)_
22
 
23
  ### **Trained On**
24
- H100 80GB 1ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ### **Instruction format**
27
 
28
- It follows **(No Input) Alpaca** format.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ## **Model Benchmark**
31
 
32
- ### **Ko-LLM-Leaderboard**
 
 
33
 
34
- On Benchmarking...
 
 
 
 
 
 
 
 
 
35
 
36
- # **Implementation Code**
37
 
38
- Since, chat_template already contains insturction format above.
39
  You can use the code below.
40
 
41
  ```python
42
  from transformers import AutoModelForCausalLM, AutoTokenizer
43
 
44
- device = "cuda"
45
 
46
- model = AutoModelForCausalLM.from_pretrained("Edentns/DataVortexS-10.7B-v0.4", device_map=device)
47
  tokenizer = AutoTokenizer.from_pretrained("Edentns/DataVortexS-10.7B-v0.4")
48
 
49
  messages = [
50
- { "role": "user", "content": "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μ•Ό?" }
 
 
 
51
  ]
52
 
53
- encoded = tokenizer.apply_chat_template(
54
- messages,
55
- add_generation_prompt=True,
56
- return_tensors="pt",
57
- return_token_type_ids=False
58
- ).to(device)
59
-
60
- decoded = model.generate(
61
- input_ids=encoded,
62
- temperature=0.2,
63
- top_p=0.9,
64
- repetition_penalty=1.2,
65
- do_sample=True,
66
- max_length=4096,
67
- eos_token_id=tokenizer.eos_token_id,
68
- pad_token_id=tokenizer.eos_token_id
69
- )
70
- decoded = decoded[0][encoded.shape[1]:decoded[0].shape[-1]]
71
- decoded_text = tokenizer.decode(decoded, skip_special_tokens=True)
72
- print(decoded_text)
73
  ```
74
 
 
 
 
 
75
  <div align="center">
76
  <a href="https://edentns.com/">
77
  <img src="./Logo.png" alt="Logo" style="height: 3em;">
 
1
  ---
2
  tags:
3
+ - text-generation
4
  license: cc-by-nc-sa-4.0
5
  language:
6
+ - ko
7
  base_model: yanolja/KoSOLAR-10.7B-v0.1
8
  pipeline_tag: text-generation
9
+ datasets:
10
+ - Edentns/data_go_kr-PublicDoc
11
+ - Edentns/aihub-TL_unanswerable_output
12
+ - Edentns/aihub-TL_span_extraction_how_output
13
+ - Edentns/aihub-TL_multiple_choice_output
14
+ - Edentns/aihub-TL_text_entailment_output
15
+ - jojo0217/korean_rlhf_dataset
16
+ - kyujinpy/KOR-OpenOrca-Platypus-v3
17
+ - beomi/KoAlpaca-v1.1a
18
+ - HumanF-MarkrAI/WIKI_QA_Near_dedup
19
  ---
20
 
21
  # **DataVortexS-10.7B-v0.4**
 
 
 
22
 
23
+ <img src="./DataVortex.png" alt="DataVortex" style="height: 8em;">
24
 
25
  ## **Model Details**
26
 
27
  ### **Base Model**
28
+
29
  [yanolja/KoSOLAR-10.7B-v0.1](https://huggingface.co/yanolja/KoSOLAR-10.7B-v0.1) _(Tokenizer Issue Fixed Version)_
30
 
31
  ### **Trained On**
32
+
33
+ - **OS**: Ubuntu 20.04
34
+ - **GPU**: H100 80GB 1ea
35
+ - **transformers**: v4.36.2
36
+
37
+ ### **Dataset**
38
+
39
+ - Edentns/data_go_kr-PublicDoc - private
40
+ - Edentns/aihub-TL_unanswerable_output - private
41
+ - Edentns/aihub-TL_span_extraction_how_output - private
42
+ - Edentns/aihub-TL_multiple_choice_output - private
43
+ - Edentns/aihub-TL_text_entailment_output - private
44
+ - [jojo0217/korean_rlhf_dataset](https://huggingface.co/datasets/jojo0217/korean_rlhf_dataset)
45
+ - [kyujinpy/KOR-OpenOrca-Platypus-v3](https://huggingface.co/datasets/kyujinpy/KOR-OpenOrca-Platypus-v3)
46
+ - [beomi/KoAlpaca-v1.1a](https://huggingface.co/datasets/beomi/KoAlpaca-v1.1a)
47
+ - [HumanF-MarkrAI/WIKI_QA_Near_dedup](https://huggingface.co/datasets/HumanF-MarkrAI/WIKI_QA_Near_dedup)
48
 
49
  ### **Instruction format**
50
 
51
+ It follows **Alpaca** format.
52
+
53
+ E.g.
54
+
55
+ ```python
56
+ text = """\
57
+ 당신은 μ‚¬λžŒλ“€μ΄ 정보λ₯Ό 찾을 수 μžˆλ„λ‘ λ„μ™€μ£ΌλŠ” 인곡지λŠ₯ λΉ„μ„œμž…λ‹ˆλ‹€.
58
+
59
+ ### Instruction:
60
+ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μ•Ό?
61
+
62
+ ### Response:
63
+ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€.
64
+
65
+ ### Instruction:
66
+ μ„œμšΈ μΈκ΅¬λŠ” 총 λͺ‡ λͺ…이야?
67
+ """
68
+ ```
69
 
70
  ## **Model Benchmark**
71
 
72
+ ### **[Ko-LLM-Leaderboard](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard)**
73
+
74
+ On Benchmarking ...
75
 
76
+ | Model | Average | Ko-ARC | Ko-HellaSwag | Ko-MMLU | Ko-TruthfulQA | Ko-CommonGen V2 |
77
+ | ---------------------------- | ------- | ------ | ------------ | ------- | ------------- | --------------- |
78
+ | DataVortexM-7B-Instruct-v0.1 | 39.81 | 34.13 | 42.35 | 38.73 | 45.46 | 38.37 |
79
+ | DataVortexS-10.7B-v0.1 | 0 | 0 | 0 | 0 | 0 | 0 |
80
+ | DataVortexS-10.7B-v0.2 | 0 | 0 | 0 | 0 | 0 | 0 |
81
+ | DataVortexS-10.7B-v0.3 | 0 | 0 | 0 | 0 | 0 | 0 |
82
+ | **DataVortexS-10.7B-v0.4** | **0** | **0** | **0** | **0** | **0** | **0** |
83
+ | DataVortexS-10.7B-v0.5 | 0 | 0 | 0 | 0 | 0 | 0 |
84
+ | DataVortexTL-1.1B-v0.1 | 0 | 0 | 0 | 0 | 0 | 0 |
85
+ | DataVortexS-10.7B-dpo-v0.1 | 0 | 0 | 0 | 0 | 0 | 0 |
86
 
87
+ ## **Implementation Code**
88
 
89
+ This model contains the chat_template instruction format.
90
  You can use the code below.
91
 
92
  ```python
93
  from transformers import AutoModelForCausalLM, AutoTokenizer
94
 
95
+ device = "cuda" # the device to load the model onto
96
 
97
+ model = AutoModelForCausalLM.from_pretrained("Edentns/DataVortexS-10.7B-v0.4")
98
  tokenizer = AutoTokenizer.from_pretrained("Edentns/DataVortexS-10.7B-v0.4")
99
 
100
  messages = [
101
+ {"role": "system", "content": "당신은 μ‚¬λžŒλ“€μ΄ 정보λ₯Ό 찾을 수 μžˆλ„λ‘ λ„μ™€μ£ΌλŠ” 인곡지λŠ₯ λΉ„μ„œμž…λ‹ˆλ‹€."},
102
+ {"role": "user", "content": "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μ•Ό?"},
103
+ {"role": "assistant", "content": "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€."},
104
+ {"role": "user", "content": "μ„œμšΈ μΈκ΅¬λŠ” 총 λͺ‡ λͺ…이야?"}
105
  ]
106
 
107
+ encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
108
+
109
+ model_inputs = encodeds.to(device)
110
+ model.to(device)
111
+
112
+ generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
113
+ decoded = tokenizer.batch_decode(generated_ids)
114
+ print(decoded[0])
 
 
 
 
 
 
 
 
 
 
 
 
115
  ```
116
 
117
+ ## **License**
118
+
119
+ The model is licensed under the [cc-by-nc-sa-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license, which allows others to copy, modify, and share the work non-commercially, as long as they give appropriate credit and distribute any derivative works under the same license.
120
+
121
  <div align="center">
122
  <a href="https://edentns.com/">
123
  <img src="./Logo.png" alt="Logo" style="height: 3em;">