ArunKr commited on
Commit
93ef792
1 Parent(s): 014df37

Delete copy_of_alpaca_+_llama_3_8b_full_example.py

Browse files
copy_of_alpaca_+_llama_3_8b_full_example.py DELETED
@@ -1,313 +0,0 @@
1
- # -*- coding: utf-8 -*-
2
- """Copy of Alpaca + Llama-3 8b full example.ipynb
3
-
4
- Automatically generated by Colab.
5
-
6
- Original file is located at
7
- https://colab.research.google.com/drive/12GTGPtaZvutZlE2GUHmeXVrvq2dgJdu6
8
-
9
- To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
10
- <div class="align-center">
11
- <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
12
- <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
13
- <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
14
- </div>
15
-
16
- To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).
17
-
18
- You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).
19
-
20
- **[NEW] Llama-3 8b is trained on a crazy 15 trillion tokens! Llama-2 was 2 trillion.**
21
- """
22
-
23
- # Commented out IPython magic to ensure Python compatibility.
24
- # %%capture
25
- # # Installs Unsloth, Xformers (Flash Attention) and all other packages!
26
- # !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
27
- # !pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes
28
-
29
- """* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
30
- * And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
31
- * We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
32
- * `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
33
- * [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
34
- """
35
-
36
- from unsloth import FastLanguageModel
37
- import torch
38
- max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
39
- dtype = None # Noad_in_4bit = True # Use 4bit quanbe False.
40
-
41
- # 4bit pre quantized models w no OOMs.
42
- fourbit_models = [
43
- "unslothistral-7b-instruct-v0.2-bnb-4bit",
44
- "unsloth/llama-2-7b-bnb-4bit",
45
- "unsloth/gemma-7b-bnb-4bit",
46
- nstruct version of Gemma 7b
47
- "unsloth/gemma-2b-bnb-4bit",
48
- "unsloth/gemma-2 Gemma 2b
49
- "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
50
- ] # More models at https://huggingface.co/unslodel.from_pretrained(
51
- model_name = "unsloth/llama-3-8b-bnb-4bit",
52
- max_seq_length = max_seq_length,
53
- dtype = dtype,
54
- load_in_4bit = load_in_4bit,
55
- token = """", # use one if using gated models like meta-llama/Llama-2-7b-hf
56
- )
57
-
58
- """We now add LoRA adapters so we only need to update 1 to 10% of all parameters!"""
59
-
60
- model = FastLanguageModel.get_peft_model(
61
- model,
62
- r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
63
- target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
64
- "gate_proj", "up_proj", "down_proj",],
65
- lora_alpha = 16,
66
- lora_dropout = 0, # Supports any, but = 0 is optimized
67
- bias = "none", # Supports any, but = "none" is optimized
68
- # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
69
- use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
70
- random_state = 3407,
71
- use_rslora = False, # We support rank stabilized LoRA
72
- loftq_config = None, # And LoftQ
73
- )
74
-
75
- """<a name="Data"></a>
76
- ### Data Prep
77
- We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.
78
-
79
- **[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).
80
-
81
- **[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!
82
-
83
- If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).
84
-
85
- For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).
86
- """
87
-
88
- alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
89
-
90
- ### Instruction:
91
- {}
92
-
93
- ### Input:
94
- {}
95
-
96
- ### Response:
97
- {}"""
98
-
99
- EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
100
- def formatting_prompts_func(examples):
101
- instructions = examples["instruction"]
102
- inputs = examples["input"]
103
- outputs = examples["output"]
104
- texts = []
105
- for instruction, input, output in zip(instructions, inputs, outputs):
106
- # Must add EOS_TOKEN, otherwise your generation will go on forever!
107
- text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
108
- texts.append(text)
109
- return { "text" : texts, }
110
- pass
111
-
112
- from datasets import load_dataset
113
- dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
114
- dataset = dataset.map(formatting_prompts_func, batched = True,)
115
-
116
- """<a name="Train"></a>
117
- ### Train the model
118
- Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!
119
- """
120
-
121
- from trl import SFTTrainer
122
- from transformers import TrainingArguments
123
-
124
- trainer = SFTTrainer(
125
- model = model,
126
- tokenizer = tokenizer,
127
- train_dataset = dataset,
128
- dataset_text_field = "text",
129
- max_seq_length = max_seq_length,
130
- dataset_num_proc = 2,
131
- packing = False, # Can make training 5x faster for short sequences.
132
- args = TrainingArguments(
133
- per_device_train_batch_size = 2,
134
- gradient_accumulation_steps = 4,
135
- warmup_steps = 5,
136
- max_steps = 60,
137
- learning_rate = 2e-4,
138
- fp16 = not torch.cuda.is_bf16_supported(),
139
- bf16 = torch.cuda.is_bf16_supported(),
140
- logging_steps = 1,
141
- optim = "adamw_8bit",
142
- weight_decay = 0.01,
143
- lr_scheduler_type = "linear",
144
- seed = 3407,
145
- output_dir = "outputs",
146
- ),
147
- )
148
-
149
- #@title Show current memory stats
150
- gpu_stats = torch.cuda.get_device_properties(0)
151
- start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
152
- max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
153
- print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
154
- print(f"{start_gpu_memory} GB of memory reserved.")
155
-
156
- trainer_stats = trainer.train()
157
-
158
- #@title Show final memory and time stats
159
- used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
160
- used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
161
- used_percentage = round(used_memory /max_memory*100, 3)
162
- lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
163
- print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
164
- print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
165
- print(f"Peak reserved memory = {used_memory} GB.")
166
- print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
167
- print(f"Peak reserved memory % of max memory = {used_percentage} %.")
168
- print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
169
-
170
- """<a name="Inference"></a>
171
- ### Inference
172
- Let's run the model! You can change the instruction and input - leave the output blank!
173
- """
174
-
175
- # alpaca_prompt = Copied from above
176
- FastLanguageModel.for_inference(model) # Enable native 2x faster inference
177
- inputs = tokenizer(
178
- [
179
- alpaca_prompt.format(
180
- "Continue the fibonnaci sequence.", # instruction
181
- "1, 1, 2, 3, 5, 8", # input
182
- "", # output - leave this blank for generation!
183
- )
184
- ], return_tensors = "pt").to("cuda")
185
-
186
- outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
187
- tokenizer.batch_decode(outputs)
188
-
189
- """ You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!"""
190
-
191
- # alpaca_prompt = Copied from above
192
- FastLanguageModel.for_inference(model) # Enable native 2x faster inference
193
- inputs = tokenizer(
194
- [
195
- alpaca_prompt.format(
196
- "Continue the fibonnaci sequence.", # instruction
197
- "1, 1, 2, 3, 5, 8", # input
198
- "", # output - leave this blank for generation!
199
- )
200
- ], return_tensors = "pt").to("cuda")
201
-
202
- from transformers import TextStreamer
203
- text_streamer = TextStreamer(tokenizer)
204
- _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
205
-
206
- """<a name="Save"></a>
207
- ### Saving, loading finetuned models
208
- To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.
209
-
210
- **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
211
- """
212
-
213
- #model.save_pretrained("lora_model") # Local saving
214
- #tokenizer.save_pretrained("lora_model")
215
- model.push_to_hub("Arun1982/LLama3-LoRA", token = """") # Online saving
216
- tokenizer.push_to_hub("Arun1982/LLama3-LoRA", token = """") # Online saving
217
-
218
- """Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:"""
219
-
220
- if False:
221
- from unsloth import FastLanguageModel
222
- model, tokenizer = FastLanguageModel.from_pretrained(
223
- model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
224
- max_seq_length = max_seq_length,
225
- dtype = dtype,
226
- load_in_4bit = load_in_4bit,
227
- )
228
- FastLanguageModel.for_inference(model) # Enable native 2x faster inference
229
-
230
- # alpaca_prompt = You MUST copy from above!
231
-
232
- inputs = tokenizer(
233
- [
234
- alpaca_prompt.format(
235
- "What is a famous tall tower in Paris?", # instruction
236
- "", # input
237
- "", # output - leave this blank for generation!
238
- )
239
- ], return_tensors = "pt").to("cuda")
240
-
241
- outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
242
- tokenizer.batch_decode(outputs)
243
-
244
- """You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**."""
245
-
246
- if False:
247
- # I highly do NOT suggest - use Unsloth if possible
248
- from peft import AutoPeftModelForCausalLM
249
- from transformers import AutoTokenizer
250
- model = AutoPeftModelForCausalLM.from_pretrained(
251
- "lora_model", # YOUR MODEL YOU USED FOR TRAINING
252
- load_in_4bit = load_in_4bit,
253
- )
254
- tokenizer = AutoTokenizer.from_pretrained("lora_model")
255
-
256
- """### Saving to float16 for VLLM
257
-
258
- We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
259
- """
260
-
261
- # Merge to 16bit
262
- if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
263
- if True: model.push_to_hub_merged("Arun1982/LLama3-LoRA", tokenizer, save_method = "merged_16bit", token = """")
264
-
265
- # Merge to 4bit
266
- if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
267
- if True: model.push_to_hub_merged("Arun1982/LLama3-LoRA", tokenizer, save_method = "merged_4bit_forced", token = """")
268
-
269
- # Just LoRA adapters
270
- if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
271
- if True: model.push_to_hub_merged("Arun1982/LLama3-LoRA", tokenizer, save_method = "lora", token = """")
272
-
273
- """### GGUF / llama.cpp Conversion
274
- To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
275
-
276
- Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
277
- * `q8_0` - Fast conversion. High resource use, but generally acceptable.
278
- * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
279
- * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
280
- """
281
-
282
- # Save to 8bit Q8_0
283
- if False: model.save_pretrained_gguf("model", tokenizer,)
284
- if True: model.push_to_hub_gguf("Arun1982/LLama3-LoRA", tokenizer, token = """")
285
-
286
- # Save to 16bit GGUF
287
- if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
288
- if True: model.push_to_hub_gguf("Arun1982/LLama3-LoRA", tokenizer, quantization_method = "f16", token = """")
289
-
290
- # Save to q4_k_m GGUF
291
- if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
292
- if True: model.push_to_hub_gguf("Arun1982/LLama3-LoRA", tokenizer, quantization_method = "q4_k_m", token = """")
293
-
294
- """Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).
295
-
296
- And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
297
-
298
- Some other links:
299
- 1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
300
- 2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
301
- 3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
302
- 4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
303
- 5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
304
- 6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
305
- 7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
306
- 8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
307
-
308
- <div class="align-center">
309
- <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
310
- <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
311
- <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
312
- </div>
313
- """