Arun Kumar Tiwary commited on
Commit
6425303
1 Parent(s): 4a7f3ab

Upload Copy_of_Alpaca_+_Llama_3_8b_full_example.ipynb

Browse files
Copy_of_Alpaca_+_Llama_3_8b_full_example.ipynb ADDED
@@ -0,0 +1,551 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "source": [
6
+ "To run this, press \"*Runtime*\" and press \"*Run all*\" on a **free** Tesla T4 Google Colab instance!\n",
7
+ "<div class=\"align-center\">\n",
8
+ " <a href=\"https://github.com/unslothai/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
9
+ " <a href=\"https://discord.gg/u54VK8m8tk\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord button.png\" width=\"145\"></a>\n",
10
+ " <a href=\"https://ko-fi.com/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png\" width=\"145\"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐\n",
11
+ "</div>\n",
12
+ "\n",
13
+ "To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).\n",
14
+ "\n",
15
+ "You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).\n",
16
+ "\n",
17
+ "**[NEW] Llama-3 8b is trained on a crazy 15 trillion tokens! Llama-2 was 2 trillion.**"
18
+ ],
19
+ "metadata": {
20
+ "id": "IqM-T1RTzY6C"
21
+ }
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "metadata": {
27
+ "id": "2eSvM9zX_2d3"
28
+ },
29
+ "outputs": [],
30
+ "source": [
31
+ "%%capture\n",
32
+ "# Installs Unsloth, Xformers (Flash Attention) and all other packages!\n",
33
+ "!pip install \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\"\n",
34
+ "!pip install --no-deps \"xformers<0.0.26\" trl peft accelerate bitsandbytes"
35
+ ]
36
+ },
37
+ {
38
+ "cell_type": "markdown",
39
+ "source": [
40
+ "* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc\n",
41
+ "* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.\n",
42
+ "* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.\n",
43
+ "* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.\n",
44
+ "* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models."
45
+ ],
46
+ "metadata": {
47
+ "id": "r2v_X2fA0Df5"
48
+ }
49
+ },
50
+ {
51
+ "cell_type": "code",
52
+ "execution_count": null,
53
+ "metadata": {
54
+ "id": "QmUBVEnvCDJv"
55
+ },
56
+ "outputs": [],
57
+ "source": [
58
+ "from unsloth import FastLanguageModel\n",
59
+ "import torch\n",
60
+ "max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!\n",
61
+ "dtype = None # Noad_in_4bit = True # Use 4bit quanbe False.\n",
62
+ "\n",
63
+ "# 4bit pre quantized models w no OOMs.\n",
64
+ "fourbit_models = [\n",
65
+ " \"unslothistral-7b-instruct-v0.2-bnb-4bit\",\n",
66
+ " \"unsloth/llama-2-7b-bnb-4bit\",\n",
67
+ " \"unsloth/gemma-7b-bnb-4bit\",\n",
68
+ " nstruct version of Gemma 7b\n",
69
+ " \"unsloth/gemma-2b-bnb-4bit\",\n",
70
+ " \"unsloth/gemma-2 Gemma 2b\n",
71
+ " \"unsloth/llama-3-8b-bnb-4bit\", # [NEW] 15 Trillion token Llama-3\n",
72
+ "] # More models at https://huggingface.co/unslodel.from_pretrained(\n",
73
+ " model_name = \"unsloth/llama-3-8b-bnb-4bit\",\n",
74
+ " max_seq_length = max_seq_length,\n",
75
+ " dtype = dtype,\n",
76
+ " load_in_4bit = load_in_4bit,\n",
77
+ " token = \"\"\"\", # use one if using gated models like meta-llama/Llama-2-7b-hf\n",
78
+ ")"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "markdown",
83
+ "source": [
84
+ "We now add LoRA adapters so we only need to update 1 to 10% of all parameters!"
85
+ ],
86
+ "metadata": {
87
+ "id": "SXd9bTZd1aaL"
88
+ }
89
+ },
90
+ {
91
+ "cell_type": "code",
92
+ "execution_count": null,
93
+ "metadata": {
94
+ "id": "6bZsfBuZDeCL"
95
+ },
96
+ "outputs": [],
97
+ "source": [
98
+ "model = FastLanguageModel.get_peft_model(\n",
99
+ " model,\n",
100
+ " r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128\n",
101
+ " target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
102
+ " \"gate_proj\", \"up_proj\", \"down_proj\",],\n",
103
+ " lora_alpha = 16,\n",
104
+ " lora_dropout = 0, # Supports any, but = 0 is optimized\n",
105
+ " bias = \"none\", # Supports any, but = \"none\" is optimized\n",
106
+ " # [NEW] \"unsloth\" uses 30% less VRAM, fits 2x larger batch sizes!\n",
107
+ " use_gradient_checkpointing = \"unsloth\", # True or \"unsloth\" for very long context\n",
108
+ " random_state = 3407,\n",
109
+ " use_rslora = False, # We support rank stabilized LoRA\n",
110
+ " loftq_config = None, # And LoftQ\n",
111
+ ")"
112
+ ]
113
+ },
114
+ {
115
+ "cell_type": "markdown",
116
+ "source": [
117
+ "<a name=\"Data\"></a>\n",
118
+ "### Data Prep\n",
119
+ "We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.\n",
120
+ "\n",
121
+ "**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).\n",
122
+ "\n",
123
+ "**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!\n",
124
+ "\n",
125
+ "If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).\n",
126
+ "\n",
127
+ "For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)."
128
+ ],
129
+ "metadata": {
130
+ "id": "vITh0KVJ10qX"
131
+ }
132
+ },
133
+ {
134
+ "cell_type": "code",
135
+ "execution_count": null,
136
+ "metadata": {
137
+ "id": "LjY75GoYUCB8"
138
+ },
139
+ "outputs": [],
140
+ "source": [
141
+ "alpaca_prompt = \"\"\"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n",
142
+ "\n",
143
+ "### Instruction:\n",
144
+ "{}\n",
145
+ "\n",
146
+ "### Input:\n",
147
+ "{}\n",
148
+ "\n",
149
+ "### Response:\n",
150
+ "{}\"\"\"\n",
151
+ "\n",
152
+ "EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN\n",
153
+ "def formatting_prompts_func(examples):\n",
154
+ " instructions = examples[\"instruction\"]\n",
155
+ " inputs = examples[\"input\"]\n",
156
+ " outputs = examples[\"output\"]\n",
157
+ " texts = []\n",
158
+ " for instruction, input, output in zip(instructions, inputs, outputs):\n",
159
+ " # Must add EOS_TOKEN, otherwise your generation will go on forever!\n",
160
+ " text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN\n",
161
+ " texts.append(text)\n",
162
+ " return { \"text\" : texts, }\n",
163
+ "pass\n",
164
+ "\n",
165
+ "from datasets import load_dataset\n",
166
+ "dataset = load_dataset(\"yahma/alpaca-cleaned\", split = \"train\")\n",
167
+ "dataset = dataset.map(formatting_prompts_func, batched = True,)"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "markdown",
172
+ "source": [
173
+ "<a name=\"Train\"></a>\n",
174
+ "### Train the model\n",
175
+ "Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!"
176
+ ],
177
+ "metadata": {
178
+ "id": "idAEIeSQ3xdS"
179
+ }
180
+ },
181
+ {
182
+ "cell_type": "code",
183
+ "execution_count": null,
184
+ "metadata": {
185
+ "id": "95_Nn-89DhsL"
186
+ },
187
+ "outputs": [],
188
+ "source": [
189
+ "from trl import SFTTrainer\n",
190
+ "from transformers import TrainingArguments\n",
191
+ "\n",
192
+ "trainer = SFTTrainer(\n",
193
+ " model = model,\n",
194
+ " tokenizer = tokenizer,\n",
195
+ " train_dataset = dataset,\n",
196
+ " dataset_text_field = \"text\",\n",
197
+ " max_seq_length = max_seq_length,\n",
198
+ " dataset_num_proc = 2,\n",
199
+ " packing = False, # Can make training 5x faster for short sequences.\n",
200
+ " args = TrainingArguments(\n",
201
+ " per_device_train_batch_size = 2,\n",
202
+ " gradient_accumulation_steps = 4,\n",
203
+ " warmup_steps = 5,\n",
204
+ " max_steps = 60,\n",
205
+ " learning_rate = 2e-4,\n",
206
+ " fp16 = not torch.cuda.is_bf16_supported(),\n",
207
+ " bf16 = torch.cuda.is_bf16_supported(),\n",
208
+ " logging_steps = 1,\n",
209
+ " optim = \"adamw_8bit\",\n",
210
+ " weight_decay = 0.01,\n",
211
+ " lr_scheduler_type = \"linear\",\n",
212
+ " seed = 3407,\n",
213
+ " output_dir = \"outputs\",\n",
214
+ " ),\n",
215
+ ")"
216
+ ]
217
+ },
218
+ {
219
+ "cell_type": "code",
220
+ "execution_count": null,
221
+ "metadata": {
222
+ "id": "2ejIt2xSNKKp",
223
+ "cellView": "form"
224
+ },
225
+ "outputs": [],
226
+ "source": [
227
+ "#@title Show current memory stats\n",
228
+ "gpu_stats = torch.cuda.get_device_properties(0)\n",
229
+ "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
230
+ "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
231
+ "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
232
+ "print(f\"{start_gpu_memory} GB of memory reserved.\")"
233
+ ]
234
+ },
235
+ {
236
+ "cell_type": "code",
237
+ "execution_count": null,
238
+ "metadata": {
239
+ "id": "yqxqAZ7KJ4oL"
240
+ },
241
+ "outputs": [],
242
+ "source": [
243
+ "trainer_stats = trainer.train()"
244
+ ]
245
+ },
246
+ {
247
+ "cell_type": "code",
248
+ "execution_count": null,
249
+ "metadata": {
250
+ "id": "pCqnaKmlO1U9",
251
+ "cellView": "form"
252
+ },
253
+ "outputs": [],
254
+ "source": [
255
+ "#@title Show final memory and time stats\n",
256
+ "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
257
+ "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
258
+ "used_percentage = round(used_memory /max_memory*100, 3)\n",
259
+ "lora_percentage = round(used_memory_for_lora/max_memory*100, 3)\n",
260
+ "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
261
+ "print(f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\")\n",
262
+ "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
263
+ "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
264
+ "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
265
+ "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
266
+ ]
267
+ },
268
+ {
269
+ "cell_type": "markdown",
270
+ "source": [
271
+ "<a name=\"Inference\"></a>\n",
272
+ "### Inference\n",
273
+ "Let's run the model! You can change the instruction and input - leave the output blank!"
274
+ ],
275
+ "metadata": {
276
+ "id": "ekOmTR1hSNcr"
277
+ }
278
+ },
279
+ {
280
+ "cell_type": "code",
281
+ "source": [
282
+ "# alpaca_prompt = Copied from above\n",
283
+ "FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n",
284
+ "inputs = tokenizer(\n",
285
+ "[\n",
286
+ " alpaca_prompt.format(\n",
287
+ " \"Continue the fibonnaci sequence.\", # instruction\n",
288
+ " \"1, 1, 2, 3, 5, 8\", # input\n",
289
+ " \"\", # output - leave this blank for generation!\n",
290
+ " )\n",
291
+ "], return_tensors = \"pt\").to(\"cuda\")\n",
292
+ "\n",
293
+ "outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)\n",
294
+ "tokenizer.batch_decode(outputs)"
295
+ ],
296
+ "metadata": {
297
+ "id": "kR3gIAX-SM2q"
298
+ },
299
+ "execution_count": null,
300
+ "outputs": []
301
+ },
302
+ {
303
+ "cell_type": "markdown",
304
+ "source": [
305
+ " You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!"
306
+ ],
307
+ "metadata": {
308
+ "id": "CrSvZObor0lY"
309
+ }
310
+ },
311
+ {
312
+ "cell_type": "code",
313
+ "source": [
314
+ "# alpaca_prompt = Copied from above\n",
315
+ "FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n",
316
+ "inputs = tokenizer(\n",
317
+ "[\n",
318
+ " alpaca_prompt.format(\n",
319
+ " \"Continue the fibonnaci sequence.\", # instruction\n",
320
+ " \"1, 1, 2, 3, 5, 8\", # input\n",
321
+ " \"\", # output - leave this blank for generation!\n",
322
+ " )\n",
323
+ "], return_tensors = \"pt\").to(\"cuda\")\n",
324
+ "\n",
325
+ "from transformers import TextStreamer\n",
326
+ "text_streamer = TextStreamer(tokenizer)\n",
327
+ "_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)"
328
+ ],
329
+ "metadata": {
330
+ "id": "e2pEuRb1r2Vg"
331
+ },
332
+ "execution_count": null,
333
+ "outputs": []
334
+ },
335
+ {
336
+ "cell_type": "markdown",
337
+ "source": [
338
+ "<a name=\"Save\"></a>\n",
339
+ "### Saving, loading finetuned models\n",
340
+ "To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.\n",
341
+ "\n",
342
+ "**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!"
343
+ ],
344
+ "metadata": {
345
+ "id": "uMuVrWbjAzhc"
346
+ }
347
+ },
348
+ {
349
+ "cell_type": "code",
350
+ "source": [
351
+ "#model.save_pretrained(\"lora_model\") # Local saving\n",
352
+ "#tokenizer.save_pretrained(\"lora_model\")\n",
353
+ "model.push_to_hub(\"Arun1982/LLama3-LoRA\", token = \"\"\"\") # Online saving\n",
354
+ "tokenizer.push_to_hub(\"Arun1982/LLama3-LoRA\", token = \"\"\"\") # Online saving"
355
+ ],
356
+ "metadata": {
357
+ "id": "upcOlWe7A1vc"
358
+ },
359
+ "execution_count": null,
360
+ "outputs": []
361
+ },
362
+ {
363
+ "cell_type": "markdown",
364
+ "source": [
365
+ "Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:"
366
+ ],
367
+ "metadata": {
368
+ "id": "AEEcJ4qfC7Lp"
369
+ }
370
+ },
371
+ {
372
+ "cell_type": "code",
373
+ "source": [
374
+ "if False:\n",
375
+ " from unsloth import FastLanguageModel\n",
376
+ " model, tokenizer = FastLanguageModel.from_pretrained(\n",
377
+ " model_name = \"lora_model\", # YOUR MODEL YOU USED FOR TRAINING\n",
378
+ " max_seq_length = max_seq_length,\n",
379
+ " dtype = dtype,\n",
380
+ " load_in_4bit = load_in_4bit,\n",
381
+ " )\n",
382
+ " FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n",
383
+ "\n",
384
+ "# alpaca_prompt = You MUST copy from above!\n",
385
+ "\n",
386
+ "inputs = tokenizer(\n",
387
+ "[\n",
388
+ " alpaca_prompt.format(\n",
389
+ " \"What is a famous tall tower in Paris?\", # instruction\n",
390
+ " \"\", # input\n",
391
+ " \"\", # output - leave this blank for generation!\n",
392
+ " )\n",
393
+ "], return_tensors = \"pt\").to(\"cuda\")\n",
394
+ "\n",
395
+ "outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)\n",
396
+ "tokenizer.batch_decode(outputs)"
397
+ ],
398
+ "metadata": {
399
+ "id": "MKX_XKs_BNZR"
400
+ },
401
+ "execution_count": null,
402
+ "outputs": []
403
+ },
404
+ {
405
+ "cell_type": "markdown",
406
+ "source": [
407
+ "You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**."
408
+ ],
409
+ "metadata": {
410
+ "id": "QQMjaNrjsU5_"
411
+ }
412
+ },
413
+ {
414
+ "cell_type": "code",
415
+ "source": [
416
+ "if False:\n",
417
+ " # I highly do NOT suggest - use Unsloth if possible\n",
418
+ " from peft import AutoPeftModelForCausalLM\n",
419
+ " from transformers import AutoTokenizer\n",
420
+ " model = AutoPeftModelForCausalLM.from_pretrained(\n",
421
+ " \"lora_model\", # YOUR MODEL YOU USED FOR TRAINING\n",
422
+ " load_in_4bit = load_in_4bit,\n",
423
+ " )\n",
424
+ " tokenizer = AutoTokenizer.from_pretrained(\"lora_model\")"
425
+ ],
426
+ "metadata": {
427
+ "id": "yFfaXG0WsQuE"
428
+ },
429
+ "execution_count": null,
430
+ "outputs": []
431
+ },
432
+ {
433
+ "cell_type": "markdown",
434
+ "source": [
435
+ "### Saving to float16 for VLLM\n",
436
+ "\n",
437
+ "We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens."
438
+ ],
439
+ "metadata": {
440
+ "id": "f422JgM9sdVT"
441
+ }
442
+ },
443
+ {
444
+ "cell_type": "code",
445
+ "source": [
446
+ "# Merge to 16bit\n",
447
+ "if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"merged_16bit\",)\n",
448
+ "if True: model.push_to_hub_merged(\"Arun1982/LLama3-LoRA\", tokenizer, save_method = \"merged_16bit\", token = \"\"\"\")\n",
449
+ "\n",
450
+ "# Merge to 4bit\n",
451
+ "if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"merged_4bit\",)\n",
452
+ "if True: model.push_to_hub_merged(\"Arun1982/LLama3-LoRA\", tokenizer, save_method = \"merged_4bit_forced\", token = \"\"\"\")\n",
453
+ "\n",
454
+ "# Just LoRA adapters\n",
455
+ "if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"lora\",)\n",
456
+ "if True: model.push_to_hub_merged(\"Arun1982/LLama3-LoRA\", tokenizer, save_method = \"lora\", token = \"\"\"\")"
457
+ ],
458
+ "metadata": {
459
+ "id": "iHjt_SMYsd3P"
460
+ },
461
+ "execution_count": null,
462
+ "outputs": []
463
+ },
464
+ {
465
+ "cell_type": "markdown",
466
+ "source": [
467
+ "### GGUF / llama.cpp Conversion\n",
468
+ "To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.\n",
469
+ "\n",
470
+ "Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):\n",
471
+ "* `q8_0` - Fast conversion. High resource use, but generally acceptable.\n",
472
+ "* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.\n",
473
+ "* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K."
474
+ ],
475
+ "metadata": {
476
+ "id": "TCv4vXHd61i7"
477
+ }
478
+ },
479
+ {
480
+ "cell_type": "code",
481
+ "source": [
482
+ "# Save to 8bit Q8_0\n",
483
+ "if False: model.save_pretrained_gguf(\"model\", tokenizer,)\n",
484
+ "if True: model.push_to_hub_gguf(\"Arun1982/LLama3-LoRA\", tokenizer, token = \"\"\"\")\n",
485
+ "\n",
486
+ "# Save to 16bit GGUF\n",
487
+ "if False: model.save_pretrained_gguf(\"model\", tokenizer, quantization_method = \"f16\")\n",
488
+ "if True: model.push_to_hub_gguf(\"Arun1982/LLama3-LoRA\", tokenizer, quantization_method = \"f16\", token = \"\"\"\")\n",
489
+ "\n",
490
+ "# Save to q4_k_m GGUF\n",
491
+ "if False: model.save_pretrained_gguf(\"model\", tokenizer, quantization_method = \"q4_k_m\")\n",
492
+ "if True: model.push_to_hub_gguf(\"Arun1982/LLama3-LoRA\", tokenizer, quantization_method = \"q4_k_m\", token = \"\"\"\")"
493
+ ],
494
+ "metadata": {
495
+ "id": "FqfebeAdT073"
496
+ },
497
+ "execution_count": null,
498
+ "outputs": []
499
+ },
500
+ {
501
+ "cell_type": "markdown",
502
+ "source": [
503
+ "Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html)."
504
+ ],
505
+ "metadata": {
506
+ "id": "bDp0zNpwe6U_"
507
+ }
508
+ },
509
+ {
510
+ "cell_type": "markdown",
511
+ "source": [
512
+ "And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n",
513
+ "\n",
514
+ "Some other links:\n",
515
+ "1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)\n",
516
+ "2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)\n",
517
+ "3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)\n",
518
+ "4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)\n",
519
+ "5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)\n",
520
+ "6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!\n",
521
+ "7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)\n",
522
+ "8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)\n",
523
+ "\n",
524
+ "<div class=\"align-center\">\n",
525
+ " <a href=\"https://github.com/unslothai/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
526
+ " <a href=\"https://discord.gg/u54VK8m8tk\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord.png\" width=\"145\"></a>\n",
527
+ " <a href=\"https://ko-fi.com/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png\" width=\"145\"></a></a> Support our work if you can! Thanks!\n",
528
+ "</div>"
529
+ ],
530
+ "metadata": {
531
+ "id": "Zt9CHJqO6p30"
532
+ }
533
+ }
534
+ ],
535
+ "metadata": {
536
+ "accelerator": "GPU",
537
+ "colab": {
538
+ "provenance": [],
539
+ "gpuType": "T4"
540
+ },
541
+ "kernelspec": {
542
+ "display_name": "Python 3",
543
+ "name": "python3"
544
+ },
545
+ "language_info": {
546
+ "name": "python"
547
+ }
548
+ },
549
+ "nbformat": 4,
550
+ "nbformat_minor": 0
551
+ }