Unleashing the Power of Unsloth and QLora:Redefining Language Model Fine-Tuning
Introduction
In the dynamic realm of language model optimization, a revolutionary force has emerged - Unsloth. This avant-garde framework, born from the minds of Daniel and Michael Han, is set to redefine the landscape of fine-tuning. As we delve into the definitions, advantages, and benefits, prepare to witness a paradigm shift in the way we approach language model optimization.
Definitions:
Unsloth is not just a library; it's a technological symphony orchestrated for the fine-tuning and training of large language models (LLMs). Specifically designed for optimal performance, Unsloth introduces innovative techniques to enhance speed, reduce memory consumption, and elevate accuracy during the fine-tuning process.
Advantages of Unsloth:
Speed Redefined : Unsloth boasts a staggering 30x increase in training speed. Alpaca, a benchmark task, now takes merely 3 hours instead of the conventional 85. This acceleration is a testament to Unsloth's commitment to efficiency and productivity.
Memory Efficiency: A game-changer in the memory domain, Unsloth promises a 60% reduction in memory usage. This not only enables the handling of larger batches but also ensures a seamless fine-tuning process without compromising on performance.
Accuracy Amplified: The authors proudly declare a 0% loss in accuracy, with an additional option for a +20% increase in accuracy using their MAX offering. This commitment to maintaining and elevating accuracy levels sets Unsloth apart in the competitive landscape.
Hardware Compatibility:Unsloth extends its reach by supporting NVIDIA, Intel, and AMD GPUs. This inclusivity ensures accessibility to a wide array of hardware configurations, making it a versatile choice for developers across different platforms.
Benefits of Fine-Tuning with Unsloth and QLora:
Efficiency Unleashed:The reduction in weights upscaling during QLoRA translates to fewer weights, resulting in a more efficient memory footprint. This efficiency, coupled with the use of bfloat16 directly, empowers developers to achieve fine-tuning goals faster and with fewer resource demands.
Innovative Attention Mechanisms: Unsloth integrates Flash Attention via xformers and Tri Dao's implementation, contributing to optimized transformer models. This innovative approach to attention mechanisms ensures that fine-tuning is not merely a technical task but a creative endeavor.
Causal Mask for Speed: The adoption of a causal mask for speeding up training, instead of a separate attention mask, showcases Unsloth's commitment to reimagining traditional methodologies. This forward-thinking approach paves the way for more efficient and faster fine-tuning.
Optimized Cross Entropy Loss: Unsloth doesn't just fine-tune; it fine-tunes with precision. The optimization of Cross Entropy loss computation significantly reduces memory consumption, ensuring that the process remains resource-friendly without compromising on accuracy.
Code Implementation
Lets deep dive into code section for finetuning with unsloth and QLora
Step 1: Install Libraries
# Import the PyTorch library
import torch
# Get the major and minor version of the current CUDA device (GPU)
major_version, minor_version = torch.cuda.get_device_capability()
# Apply the following if the GPU has Ampere or Hopper architecture (RTX 30xx, RTX 40xx, A100, H100, L40, etc.)
if major_version >= 8:
# Install the Unsloth library for Ampere and Hopper architecture from GitHub
!pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git" -q
# Apply the following for older GPUs (V100, Tesla T4, RTX 20xx, etc.)
else:
# Install the Unsloth library for older GPUs from GitHub
!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git" -q
# Placeholder statement (does nothing)
pass
# Install the Hugging Face Transformers library from GitHub, which allows native 4-bit loading
!pip install "git+https://github.com/huggingface/transformers.git" -q
!pip install trl datasets -q
Step 2: Import Libraries
from unsloth import FastLanguageModel
# Import FastLanguageModel from the Unsloth library
max_seq_length = 2048 # Can be set arbitrarily, automatically supports RoPE scaling!
# Set the maximum sequence length to 2048 (can be changed arbitrarily)
dtype = None # Automatically detect if None. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
# Set the data type (automatically detect if None, can also specify Float16 or Bfloat16)
load_in_4bit = True # Reduce memory usage using 4-bit quantization. Can be set to False.
# Reduce memory usage using 4-bit quantization (can be set to False to disable)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b-bnb-4bit", # Use "unsloth/mistral-7b" for 16-bit loading
# Load the model "unsloth/mistral-7b-bnb-4bit" from pre-training (use "unsloth/mistral-7b" for 16-bit loading)
max_seq_length=max_seq_length,
# Set the maximum sequence length
dtype=dtype,
# Set the data type
load_in_4bit=load_in_4bit,
# Apply the settings for 4-bit loading
# token="hf_...", # Use the token when using a gate model (e.g., meta-llama/Llama-2-7b-hf)
# Use Hugging Face's token when using a gate model, or similar cases
)
Add LoRA Adapter and update only 1-10% of all parameters!
model = FastLanguageModel.get_peft_model(
model,
# Specify the existing model
r=16, # Choose any positive number! Recommended values include 8, 16, 32, 64, 128, etc.
# Rank parameter for LoRA. The smaller this value, the fewer parameters will be modified.
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
# Specify the modules to which LoRA will be applied
lora_alpha=16,
# Alpha parameter for LoRA. This value determines the strength of the applied LoRA.
lora_dropout=0, # Currently, only supports dropout = 0
# Dropout rate for LoRA. Currently supports only 0.
bias="none", # Currently, only supports bias = "none"
# Bias usage setting. Currently supports only the setting without bias.
use_gradient_checkpointing=True,
# Whether to use gradient checkpointing to improve memory efficiency
random_state=3407,
# Seed value for random number generation
max_seq_length=max_seq_length,
# Set the maximum sequence length
)
Step 3: Load Dataset
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
# Define the prompt format for the Alpaca dataset
def formatting_prompts_func(examples):
# Define a function to format each example in the dataset
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
# Get instructions, inputs, and outputs
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# Generate text by combining instructions, inputs, and outputs
text = alpaca_prompt.format(instruction, input, output)
# Format the text according to the prompt format
texts.append(text)
return { "text" : texts, }
# Return a list of formatted texts
pass
# Placeholder (does nothing)
from datasets import load_dataset
# Import the load_dataset function from the datasets library
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
# Load the training data of the cleaned version of the Alpaca dataset from yahma
dataset = dataset.map(formatting_prompts_func, batched=True,)
# Apply the formatting_prompts_func function to the dataset with batch processing
Step IV: Training Model
from trl import SFTTrainer
# Import SFTTrainer from the TRL library
from transformers import TrainingArguments
# Import TrainingArguments from the Transformers library
trainer = SFTTrainer(
# Initialize the SFTTrainer
model=model,
# Specify the model to be used
train_dataset=dataset,
# Specify the training dataset
dataset_text_field="text",
# Specify the text field in the dataset
max_seq_length=max_seq_length,
# Specify the maximum sequence length
args=TrainingArguments(
# Specify training arguments
per_device_train_batch_size=2,
# Specify the training batch size per device
gradient_accumulation_steps=4,
# Specify the number of steps for gradient accumulation
warmup_steps=5,
# Specify the number of warm-up steps
max_steps=20,
# Specify the maximum number of steps
learning_rate=2e-4,
# Specify the learning rate
fp16=not torch.cuda.is_bf16_supported(),
# Set whether to use 16-bit floating-point precision (fp16)
bf16=torch.cuda.is_bf16_supported(),
# Set whether to use Bfloat16
logging_steps=1,
# Specify the logging steps
optim="adamw_8bit",
# Specify the optimizer (here using 8-bit AdamW)
weight_decay=0.01,
# Specify the weight decay value
lr_scheduler_type="linear",
# Specify the type of learning rate scheduler (linear)
seed=3407,
# Specify the random seed
output_dir="outputs",
# Specify the output directory
),
)
Step V: Display Current Memory Statistics
gpu_stats = torch.cuda.get_device_properties(0)
# Get properties of the GPU device at index 0
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
# Get the maximum reserved GPU memory in GB and round to 3 decimal places
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
# Get the total GPU memory in GB and round to 3 decimal places
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
# Display the GPU name and maximum memory
print(f"{start_gpu_memory} GB of memory reserved.")
# Display the reserved memory amount
Step VI: Execute the Train Method
trainer_stats = trainer.train()
Step VII: Conversion Code to GGUF
def colab_quantize_to_gguf(save_directory, quantization_method="q4_k_m"):
# Define a function for conversion to GGUF
from transformers.models.llama.modeling_llama import logger
import os
# Import necessary libraries
logger.warning_once(
"Unsloth: `colab_quantize_to_gguf` is still in development mode.\n"\
"If anything errors or breaks, please file a ticket on Github.\n"\
"Also, if you used this successfully, please tell us on Discord!"
)
# Warn that it's still in development mode and encourage reporting any issues
# From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
ALLOWED_QUANTS = \
{
# Define currently allowed quantization methods
# Including descriptions for each quantization method
"q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s" : "Uses Q3_K for all tensors",
"q4_0" : "Original quant method, 4-bit.",
"q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_m" : "Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q4_k_s" : "Uses Q4_K for all tensors",
"q5_0" : "Higher accuracy, higher resource usage and slower inference.",
"q5_1" : "Even higher accuracy, resource usage and slower inference.",
"q5_k_m" : "Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q5_k_s" : "Uses Q5_K for all tensors",
"q6_k" : "Uses Q8_K for all tensors",
"q8_0" : "Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.",
}
if quantization_method not in ALLOWED_QUANTS.keys():
# If the specified quantization method is not allowed, raise an error
error = f"Unsloth: Quant method = [{quantization_method}] not supported. Choose from below:\n"
for key, value in ALLOWED_QUANTS.items():
error += f"[{key}] => {value}\n"
raise RuntimeError(error)
# Display information about the conversion
print_info = \
f"==((====))== Unsloth: Conversion from QLoRA to GGUF information\n"\
f" \\\ /| [0] Installing llama.cpp will take 3 minutes.\n"\
f"O^O/ \_/ \\ [1] Converting HF to GUUF 16bits will take 3 minutes.\n"\
f"\ / [2] Converting GGUF 16bits to q4_k_m will take 20 minutes.\n"\
f' "-____-" In total, you will have to wait around 26 minutes.\n'
print(print_info)
# Display information about the conversion process
if not os.path.exists("llama.cpp"):
# If llama.cpp does not exist, install it
print("Unsloth: [0] Installing llama.cpp. This will take 3 minutes...")
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && make clean && LLAMA_CUBLAS=1 make -j
!pip install gguf protobuf
pass
print("Unsloth: Starting conversion from HF to GGUF 16bit...")
# Display that conversion from HF to GGUF 16bit is starting
# print("Unsloth: [1] Converting HF into GGUF 16bit. This will take 3 minutes...")
!python llama.cpp/convert.py {save_directory} \
--outfile {save_directory}-unsloth.gguf \
--outtype f16
print("Unsloth: Starting conversion from GGUF 16bit to q4_k_m...")
# Display that conversion from GGUF 16bit to the specified quantization method is starting
# print("Unsloth: [2] Converting GGUF 16bit into q4_k_m. This will take 20 minutes...")
final_location = f"./{save_directory}-{quantization_method}-unsloth.gguf"
!./llama.cpp/quantize ./{save_directory}-unsloth.gguf \
{final_location} {quantization_method}
print(f"Unsloth: Output location: {final_location}")
# Display the output location of the converted file
pass
from unsloth import unsloth_save_model
# Import the unsloth_save_model function from the Unsloth library
# unsloth_save_model has the same args as model.save_pretrained
# unsloth_save_model has the same arguments as model.save_pretrained
unsloth_save_model(model, tokenizer, "output_model", push_to_hub=False, token=None)
# Save the model and tokenizer as "output_model". Do not push to the Hugging Face Hub
colab_quantize_to_gguf("output_model", quantization_method="q4_k_m")
# Convert "output_model" to GGUF format. Use the quantization method "q4_k_m"
Conclusion
In closing, our exploration with Unsloth has been a captivating journey into the frontier of advanced language models and AI innovations. From Ampere and Hopper architectures to the artistry of Low-Rank Adaptation adapters, we navigated the realms of data preparation, model training, and memory optimization.
The Alpaca dataset, enhanced through TRL principles, served as our canvas. We delved into memory usage intricacies, time statistics, and the realm of GGUF transformations, showcasing technical prowess and creativity.
As our article concludes, the Unsloth library stands as a testament to the fusion of technology and creativity. Our journey's final act saw the model transformed into GGUF format, highlighting the adaptability of our tools.
This exploration wasn't just about code; it was a quest for innovation and inspiration. Unsloth's commitment to originality and storytelling invites us to continue pushing the boundaries in the ever-evolving landscape of language models and AI.
“Stay connected and support my work through various platforms:
Medium: You can read my latest articles and insights on Medium at https://medium.com/@andysingal
Paypal: Enjoyed my article? Buy me a coffee! https://paypal.me/alphasingal?country.x=US&locale.x=en_US"
Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.
Resources: