Asuncom commited on
Commit
8a60638
·
verified ·
1 Parent(s): 0b05c34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +333 -0
README.md CHANGED
@@ -20,3 +20,336 @@ base_model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
20
  This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
 
22
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
 
22
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
23
+ ```python
24
+ !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
25
+ ```
26
+
27
+ ```python
28
+ !pip install --upgrade pip
29
+ ```
30
+
31
+ ```python
32
+ !pip install --no-deps "xformers<0.0.26" "trl<0.9.0" peft accelerate bitsandbytes
33
+ ```
34
+
35
+ ```python
36
+ from unsloth import FastLanguageModel
37
+ import torch
38
+ max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
39
+ dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
40
+ load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
41
+
42
+ # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
43
+ fourbit_models = [
44
+ "unsloth/Meta-Llama-3.1-8B-bnb-4bit", # Llama-3.1 15 trillion tokens model 2x faster!
45
+ "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
46
+ "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
47
+ "unsloth/Meta-Llama-3.1-405B-bnb-4bit", # We also uploaded 4bit for 405b!
48
+ "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
49
+ "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
50
+ "unsloth/mistral-7b-v0.3-bnb-4bit", # Mistral v3 2x faster!
51
+ "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
52
+ "unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!d
53
+ "unsloth/Phi-3-medium-4k-instruct",
54
+ "unsloth/gemma-2-9b-bnb-4bit",
55
+ "unsloth/gemma-2-27b-bnb-4bit", # Gemma 2x faster!
56
+ ] # More models at https://huggingface.co/unsloth
57
+
58
+ model, tokenizer = FastLanguageModel.from_pretrained(
59
+ model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
60
+ max_seq_length = max_seq_length,
61
+ dtype = dtype,
62
+ load_in_4bit = load_in_4bit,
63
+ # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
64
+ )
65
+ ```
66
+
67
+ ```python
68
+ # ========================================================
69
+ # Test before training
70
+ # ========================================================
71
+ alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
72
+
73
+ ### Instruction:
74
+ {}
75
+
76
+ ### Input:
77
+ {}
78
+
79
+ ### Response:
80
+ {}"""
81
+ FastLanguageModel.for_inference(model) # Enable native 2x faster inference
82
+ inputs = tokenizer(
83
+ [
84
+ alpaca_prompt.format(
85
+ "请把现代汉语翻译成古文", # instruction
86
+ "其品行廉正,所以至死也不放松对自己的要求。", # input
87
+ "", # output - leave this blank for generation!
88
+ )
89
+ ], return_tensors = "pt").to("cuda")
90
+
91
+ from transformers import TextStreamer
92
+ text_streamer = TextStreamer(tokenizer)
93
+ _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
94
+ ```
95
+
96
+ ```python
97
+ model = FastLanguageModel.get_peft_model(
98
+ model,
99
+ r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
100
+ target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
101
+ "gate_proj", "up_proj", "down_proj",],
102
+ lora_alpha = 16,
103
+ lora_dropout = 0, # Supports any, but = 0 is optimized
104
+ bias = "none", # Supports any, but = "none" is optimized
105
+ # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
106
+ use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
107
+ random_state = 3407,
108
+ use_rslora = False, # We support rank stabilized LoRA
109
+ loftq_config = None, # And LoftQ
110
+ )
111
+ ```
112
+
113
+ ```python
114
+ alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
115
+
116
+ ### Instruction:
117
+ {}
118
+
119
+ ### Input:
120
+ {}
121
+
122
+ ### Response:
123
+ {}"""
124
+
125
+ EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
126
+ def formatting_prompts_func(examples):
127
+ instructions = examples["instruction"]
128
+ inputs = examples["input"]
129
+ outputs = examples["output"]
130
+ texts = []
131
+ for instruction, input, output in zip(instructions, inputs, outputs):
132
+ # Must add EOS_TOKEN, otherwise your generation will go on forever!
133
+ text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
134
+ texts.append(text)
135
+ return { "text" : texts, }
136
+ pass
137
+
138
+ from datasets import load_dataset
139
+ dataset = load_dataset("Asuncom/shiji-qishiliezhuan", split = "train")
140
+ dataset = dataset.map(formatting_prompts_func, batched = True,)
141
+ ```
142
+
143
+ ```python
144
+ from trl import SFTTrainer
145
+ from transformers import TrainingArguments
146
+ from unsloth import is_bfloat16_supported
147
+
148
+ trainer = SFTTrainer(
149
+ model = model,
150
+ tokenizer = tokenizer,
151
+ train_dataset = dataset,
152
+ dataset_text_field = "text",
153
+ max_seq_length = max_seq_length,
154
+ dataset_num_proc = 2,
155
+ packing = False, # Can make training 5x faster for short sequences.
156
+ args = TrainingArguments(
157
+ per_device_train_batch_size = 2,
158
+ gradient_accumulation_steps = 4,
159
+ warmup_steps = 5,
160
+ # num_train_epochs = 1, # Set this for 1 full training run.
161
+ max_steps = 100,
162
+ learning_rate = 2e-4,
163
+ fp16 = not is_bfloat16_supported(),
164
+ bf16 = is_bfloat16_supported(),
165
+ logging_steps = 1,
166
+ optim = "adamw_8bit",
167
+ weight_decay = 0.01,
168
+ lr_scheduler_type = "linear",
169
+ seed = 3407,
170
+ output_dir = "outputs",
171
+ ),
172
+ )
173
+ ```
174
+
175
+ ```python
176
+ #@title Show current memory stats
177
+ gpu_stats = torch.cuda.get_device_properties(0)
178
+ start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
179
+ max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
180
+ print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
181
+ print(f"{start_gpu_memory} GB of memory reserved.")
182
+ ```
183
+
184
+ ```python
185
+ import wandb
186
+
187
+ # 初始化一个离线模式的W&B运行
188
+ wandb.init(mode="offline", project="asuncom", entity="asuncom")
189
+ ```
190
+
191
+ ```python
192
+ trainer_stats = trainer.train()
193
+ ```
194
+
195
+ ```python
196
+ #@title Show final memory and time stats
197
+ used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
198
+ used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
199
+ used_percentage = round(used_memory /max_memory*100, 3)
200
+ lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
201
+ print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
202
+ print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
203
+ print(f"Peak reserved memory = {used_memory} GB.")
204
+ print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
205
+ print(f"Peak reserved memory % of max memory = {used_percentage} %.")
206
+ print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
207
+ ```
208
+
209
+ ```python
210
+ # alpaca_prompt = Copied from above
211
+ FastLanguageModel.for_inference(model) # Enable native 2x faster inference
212
+ inputs = tokenizer(
213
+ [
214
+ alpaca_prompt.format(
215
+ "请把现代汉语翻译成古文", # instruction
216
+ "其品行廉正,所以至死也不放松对自己的要求。", # input
217
+ "", # output - leave this blank for generation!
218
+ )
219
+ ], return_tensors = "pt").to("cuda")
220
+
221
+ from transformers import TextStreamer
222
+ text_streamer = TextStreamer(tokenizer)
223
+ _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
224
+ ```
225
+
226
+ ```python
227
+ model.save_pretrained("lora_model") # Local saving
228
+ tokenizer.save_pretrained("lora_model")
229
+ model.push_to_hub("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", token = "hf_huggingface的密钥NeKb") # Online saving
230
+ tokenizer.push_to_hub("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", token = "hf_huggingface的密钥saving
231
+ ```
232
+
233
+ ```python
234
+ if False:
235
+ from unsloth import FastLanguageModel
236
+ model, tokenizer = FastLanguageModel.from_pretrained(
237
+ model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
238
+ max_seq_length = max_seq_length,
239
+ dtype = dtype,
240
+ load_in_4bit = load_in_4bit,
241
+ )
242
+ FastLanguageModel.for_inference(model) # Enable native 2x faster inference
243
+
244
+ # alpaca_prompt = You MUST copy from above!
245
+
246
+ inputs = tokenizer(
247
+ [
248
+ alpaca_prompt.format(
249
+ "What is a famous tall tower in Paris?", # instruction
250
+ "", # input
251
+ "", # output - leave this blank for generation!
252
+ )
253
+ ], return_tensors = "pt").to("cuda")
254
+
255
+ from transformers import TextStreamer
256
+ text_streamer = TextStreamer(tokenizer)
257
+ _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
258
+ ```
259
+
260
+ ```python
261
+ # Merge to 16bit
262
+ if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
263
+ if False: model.push_to_hub_merged("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, save_method = "merged_16bit", token = "hf_huggingface的密钥NeKb")
264
+
265
+ # Merge to 4bit
266
+ if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
267
+ if False: model.push_to_hub_merged("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, save_method = "merged_4bit", token = "hf_huggingface的密钥oRA adapters
268
+ if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
269
+ if False: model.push_to_hub_merged("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, save_method = "lora", token = "hf_huggingface的密钥
270
+ ```
271
+
272
+ ```python
273
+ # Save to 8bit Q8_0
274
+ if False: model.save_pretrained_gguf("model", tokenizer,)
275
+ # Remember to go to https://huggingface.co/settings/tokens for a token!
276
+ # And change hf to your username!
277
+ if False: model.push_to_hub_gguf("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, token = "")
278
+
279
+ # Save to 16bit GGUF
280
+ if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
281
+ if False: model.push_to_hub_gguf("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, quantization_method = "f16", token = "")
282
+
283
+ # Save to q4_k_m GGUF
284
+ if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
285
+ if True: model.push_to_hub_gguf("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, quantization_method = "q4_k_m", token = "hf_xxxxx")
286
+
287
+ # Save to multiple GGUF options - much faster if you want multiple!
288
+ if False:
289
+ model.push_to_hub_gguf(
290
+ "Asuncom/Llama-3.1-8B-bnb-4bit-shiji", # Change hf to your username!
291
+ tokenizer,
292
+ quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
293
+ token = "hf_huggingface的密钥NeKb", # Get a token at https://huggingface.co/settings/tokens
294
+ )
295
+ ```
296
+
297
+ ```python
298
+ model.push_to_hub_gguf(
299
+ "Asuncom/Llama-3.1-8B-bnb-4bit-shiji", # Change hf to your username!
300
+ tokenizer,
301
+ quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
302
+ token = "hf_huggingface的密钥NeKb", # Get a token at https://huggingface.co/settings/tokens
303
+ )
304
+ ```
305
+
306
+ ```
307
+ [ 279/ 292] blk.30.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q5_K .. size = 32.00 MiB -> 11.00 MiB
308
+ [ 280/ 292] blk.30.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q5_K .. size = 32.00 MiB -> 11.00 MiB
309
+ [ 281/ 292] blk.30.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q6_K .. size = 8.00 MiB -> 3.28 MiB
310
+ [ 282/ 292] blk.31.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q5_K .. size = 112.00 MiB -> 38.50 MiB
311
+ [ 283/ 292] blk.31.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q5_K .. size = 112.00 MiB -> 38.50 MiB
312
+ [ 284/ 292] blk.31.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q5_K .. size = 8.00 MiB -> 2.75 MiB
313
+ [ 285/ 292] blk.31.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q5_K .. size = 32.00 MiB -> 11.00 MiB
314
+ [ 286/ 292] blk.31.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q5_K .. size = 32.00 MiB -> 11.00 MiB
315
+ [ 287/ 292] blk.31.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q6_K .. size = 8.00 MiB -> 3.28 MiB
316
+ [ 288/ 292] output.weight - [ 4096, 128256, 1, 1], type = f16, converting to q6_K .. size = 1002.00 MiB -> 410.98 MiB
317
+ [ 289/ 292] blk.31.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
318
+ [ 290/ 292] blk.31.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q6_K .. size = 112.00 MiB -> 45.94 MiB
319
+ [ 291/ 292] blk.31.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
320
+ [ 292/ 292] output_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
321
+ llama_model_quantize_internal: model size = 15317.02 MB
322
+ llama_model_quantize_internal: quant size = 5459.93 MB
323
+
324
+ main: quantize time = 147401.53 ms
325
+ main: total time = 147401.53 ms
326
+ Unsloth: Conversion completed! Output location: ./Asuncom/Llama-3.1-8B-bnb-4bit-shiji/unsloth.Q5_K_M.gguf
327
+ Unsloth: Uploading GGUF to Huggingface Hub...
328
+
329
+
330
+ unsloth.F16.gguf: 100%|██████████| 16.1G/16.1G [26:20<00:00, 10.2MB/s]
331
+
332
+
333
+ Saved GGUF to https://huggingface.co/Asuncom/Llama-3.1-8B-bnb-4bit-shiji
334
+ Unsloth: Uploading GGUF to Huggingface Hub...
335
+
336
+
337
+ unsloth.Q4_K_M.gguf: 100%|██████████| 4.92G/4.92G [08:05<00:00, 10.1MB/s]
338
+
339
+
340
+ Saved GGUF to https://huggingface.co/Asuncom/Llama-3.1-8B-bnb-4bit-shiji
341
+ Unsloth: Uploading GGUF to Huggingface Hub...
342
+
343
+
344
+ unsloth.Q8_0.gguf: 100%|██████████| 8.54G/8.54G [13:48<00:00, 10.3MB/s]
345
+
346
+
347
+ Saved GGUF to https://huggingface.co/Asuncom/Llama-3.1-8B-bnb-4bit-shiji
348
+ Unsloth: Uploading GGUF to Huggingface Hub...
349
+
350
+
351
+ unsloth.Q5_K_M.gguf: 100%|██████████| 5.73G/5.73G [09:24<00:00, 10.2MB/s]
352
+
353
+
354
+ Saved GGUF to https://huggingface.co/Asuncom/Llama-3.1-8B-bnb-4bit-shipython
355
+ ```