---
library_name: transformers
tags: []
---


---

# Fine-Tuning LLaMA-2-7b with QLoRA on Custom Dataset

This repository provides a setup and script for fine-tuning the LLaMA-2-7b model using QLoRA (Quantized Low-Rank Adaptation) with custom datasets. The script is designed for efficiency and flexibility in training large language models (LLMs) by leveraging advanced techniques such as 4-bit quantization and LoRA.

## Overview

The script fine-tunes a pre-trained LLaMA-2-7b model using a custom dataset, applying QLoRA techniques to optimize performance. It utilizes the `transformers`, `datasets`, `peft`, and `trl` libraries for model management, data processing, and training. The setup includes support for mixed precision training, gradient checkpointing, and advanced quantization techniques to enhance the efficiency of the fine-tuning process.

## Components

### 1. Dependencies

Ensure the following libraries are installed:
- `torch`
- `datasets`
- `transformers`
- `peft`
- `trl`

Install them using pip if they are not already available:
```bash
pip install torch datasets transformers peft trl
```

### 2. Model and Dataset

- **Model**: The base model used is `LLaMA-2-7b`. The script loads this model from a specified local directory.
- **Dataset**: The training data is loaded from a specified directory. The dataset should be formatted in a way that the `"text"` field contains the training examples.

### 3. QLoRA Configuration

QLoRA parameters are used to configure the quantization and adaptation process:
- **LoRA Attention Dimension (`lora_r`)**: 64
- **LoRA Alpha Parameter (`lora_alpha`)**: 16
- **LoRA Dropout Probability (`lora_dropout`)**: 0.1

### 4. BitsAndBytes Configuration

Quantization settings for the model:
- **Use 4-bit Precision**: True
- **Compute Data Type**: `float16`
- **Quantization Type**: `nf4`
- **Nested Quantization**: False

### 5. Training Configuration

Training parameters are defined as follows:
- **Output Directory**: `./results`
- **Number of Epochs**: 300
- **Batch Size**: 4
- **Gradient Accumulation Steps**: 1
- **Learning Rate**: 2e-4
- **Weight Decay**: 0.001
- **Optimizer**: `paged_adamw_32bit`
- **Learning Rate Scheduler**: `cosine`
- **Gradient Clipping**: 0.3
- **Warmup Ratio**: 0.03
- **Logging Steps**: 25
- **Save Steps**: 0

### 6. Training and Evaluation

The script includes preprocessing of the dataset, model initialization with QLoRA, and training using `SFTTrainer` from the `trl` library. It supports mixed precision training and gradient checkpointing to enhance training efficiency.

### 7. Usage Instructions

1. **Update File Paths**: Adjust `model_name`, `dataset_name`, and `new_model` paths according to your environment.
2. **Run the Script**: Execute the script in your Python environment to start the fine-tuning process.

```bash
python fine_tune_llama.py
```

3. **Monitor Training**: Use TensorBoard or similar tools to monitor the training progress.

### 8. Model Saving

After training, the model is saved to the specified directory (`new_model`). This trained model can be loaded for further evaluation or deployment.

## Example Configuration

Here’s an example configuration used for fine-tuning:

__hint__: the base model is: NousResearch/Llama-2-7b-chat-hf
__hint__: the dataset is: mlabonne/guanaco-llama2-1k
__hint__: I saved them on my local machine then laod them! you can directly download them from huggingface

```python
model_name = "/data/bio-eng-llm/llm_repo/NousResearch/Llama-2-7b-chat-hf" # the base model is: NousResearch/Llama-2-7b-chat-hf
dataset_name = "/data/bio-eng-llm/llm_repo/mlabonne/guanaco-llama2-1k" # the dataset is: mlabonne/guanaco-llama2-1k
new_model = "/data/bio-eng-llm/llm_repo/mlabonne/llama-2-7b-miniguanaco"

lora_r = 64
lora_alpha = 16
lora_dropout = 0.1

use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

output_dir = "./results"
num_train_epochs = 300
fp16 = False
bf16 = False
per_device_train_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 0
logging_steps = 25
```


# The entire Python training module:

```python


import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer


import sys
import os

cwd = os.getcwd()
# sys.path.append(cwd + '/my_directory')
sys.path.append(cwd)


def setting_directory(depth):
    current_dir = os.path.abspath(os.getcwd())
    root_dir = current_dir
    for i in range(depth):
        root_dir = os.path.abspath(os.path.join(root_dir, os.pardir))
        sys.path.append(os.path.dirname(root_dir))
    return root_dir

#################################
#S:\Llavar_repo\LLaVA\NousResearch\Llama-2-7b-chat-hf

# The model that you want to train from the Hugging Face hub


model_name = "/data/bio-eng-llm/llm_repo/NousResearch/Llama-2-7b-chat-hf"


#model_name = setting_directory(2) + "\\Llavar_repo\\LLaVA\NousResearch\\Llama-2-7b-chat-hf"


# The instruction dataset to use
dataset_name = "/data/bio-eng-llm/llm_repo/mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "/data/bio-eng-llm/llm_repo/mlabonne/llama-2-7b-miniguanaco"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 300

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}


################################################################################


# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

print(dataset[0].keys())  # This will print all the field names in your dataset

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_dataset = dataset.map(preprocess_function, batched=True)

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)
```


# Testing the fine tuned model on the dataset


```python

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer


import sys
import os


base_model_name = "/data/bio-eng-llm/llm_repo/NousResearch/Llama-2-7b-chat-hf"  # The base model you fine-tuned


cwd = os.getcwd()
# sys.path.append(cwd + '/my_directory')
sys.path.append(cwd)


def setting_directory(depth):
    current_dir = os.path.abspath(os.getcwd())
    root_dir = current_dir
    for i in range(depth):
        root_dir = os.path.abspath(os.path.join(root_dir, os.pardir))
        sys.path.append(os.path.dirname(root_dir))
    return root_dir


# The instruction dataset to use
dataset_name = "/data/bio-eng-llm/llm_repo/mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "/data/bio-eng-llm/llm_repo/mlabonne/llama-2-7b-miniguanaco"


############################
############################  Loading the fine tunned model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Base model path (you've trained this model using PEFT)
base_model_name = "NousResearch/Llama-2-7b-chat-hf"

# Load the base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Path to the directory containing adapter_config.json and adapter_model.safetensors
fine_tuned_model_path = "/data/bio-eng-llm/llm_repo/mlabonne/llama-2-7b-miniguanaco"

# Load the fine-tuned model (PEFT adapter)
model = PeftModel.from_pretrained(model, fine_tuned_model_path)


print(model)


####################################################
####################################################
####################################################
import os
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import json

# Define paths
base_model_name = "/data/bio-eng-llm/llm_repo/NousResearch/Llama-2-7b-chat-hf"
fine_tuned_model_path = "/data/bio-eng-llm/llm_repo/mlabonne/llama-2-7b-miniguanaco"
dataset_name = "/data/bio-eng-llm/llm_repo/mlabonne/guanaco-llama2-1k"

# Load the dataset
dataset = load_dataset(dataset_name, split="train")

# Initialize the tokenizer and load the base model
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(base_model, fine_tuned_model_path)

# Set the model to evaluation mode
model.eval()

# Define a function to evaluate the model on a small portion of the dataset
def evaluate_model(dataset, tokenizer, model, sample_size=10, max_length=512, max_new_tokens=50):
    # Select a small portion of the dataset
    subset = dataset.select(range(min(sample_size, len(dataset))))
    
    results = []
    for example in subset:
        # Tokenize the input
        inputs = tokenizer(example['text'], return_tensors="pt", truncation=True, padding='max_length', max_length=max_length)
        
        # Ensure no gradients are calculated during inference
        with torch.no_grad():
            # Generate responses
            outputs = model.generate(
                input_ids=inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_length=max_length + max_new_tokens,  # Adjust max_length to allow for new tokens
                max_new_tokens=max_new_tokens  # Allow generating up to `max_new_tokens`
            )
        
        # Decode the output
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Append result
        results.append({
            'input_text': example['text'],
            'generated_text': generated_text
        })
    
    return results

# Evaluate the model on a small portion of the dataset (e.g., 10 samples)
evaluation_results = evaluate_model(dataset, tokenizer, model, sample_size=10)

# Print a few results
for result in evaluation_results:  # Print the results
    print(f"Input Text: {result['input_text']}")
    print(f"Generated Text: {result['generated_text']}")
    print("-" * 50)

# Optionally, save results to a file
with open('evaluation_results.json', 'w') as f:
    json.dump(evaluation_results, f, indent=4)

```


# Pushing the model to Huggingface

__hint__: I saved everything on my local machine then I pushed it into huggingface!
__hint__: You need "Your-Huggingface-ID" and "Your-Huggingface-Token"

```python

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, logging
from huggingface_hub import HfApi, Repository, login

from peft import LoraConfig, PeftModel


# Define paths
base_model_name = "/data/bio-eng-llm/llm_repo/NousResearch/Llama-2-7b-chat-hf"
fine_tuned_model_path = "/data/bio-eng-llm/llm_repo/mlabonne/llama-2-7b-miniguanaco"
save_directory = "./fine_tuned_model"  # Local directory to save the model
repo_name = "Your-Huggingface-ID/llama-2-7b-miniguanaco"  # Replace with your Hugging Face username and model repo name


# Login to Hugging Face


# Step 1: Log in to Hugging Face
print("Logging in to Hugging Face...")
login(token="Your-Huggingface-Token")

# Step 2: Load the tokenizer and model
print("Loading base model and fine-tuned adapters...")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(base_model, fine_tuned_model_path)

# Step 3: Save the tokenizer and the fine-tuned model
print(f"Saving the fine-tuned model to {save_directory}...")
os.makedirs(save_directory, exist_ok=True)
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

# Step 4: Push the model to Hugging Face Hub
print(f"Pushing the model to the Hugging Face Hub: {repo_name}...")
model.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

print("Model pushed successfully!")
```

## Log file after pushing:

```bash
Logging in to Hugging Face...
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /home/forootan/.cache/huggingface/token
Login successful
Loading base model and fine-tuned adapters...

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:29<00:29, 29.95s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:41<00:00, 19.17s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:41<00:00, 20.79s/it]
/data/bio-eng-llm/miniconda3/envs/llava_main/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /data/bio-eng-llm/llm_repo/NousResearch/Llama-2-7b-chat-hf - will assume that the vocabulary was not modified.
  warnings.warn(
Saving the fine-tuned model to ./fine_tuned_model...
Pushing the model to the Hugging Face Hub: Ali-Forootani/llama-2-7b-miniguanaco...

adapter_model.safetensors:   0%|          | 0.00/134M [00:00<?, ?B/s]
adapter_model.safetensors:  12%|█▏        | 16.0M/134M [00:01<00:12, 9.78MB/s]
adapter_model.safetensors:  24%|██▍       | 32.0M/134M [00:03<00:09, 10.5MB/s]
adapter_model.safetensors:  36%|███▌      | 48.0M/134M [00:04<00:08, 10.5MB/s]
adapter_model.safetensors:  48%|████▊     | 64.0M/134M [00:06<00:06, 10.7MB/s]
adapter_model.safetensors:  60%|█████▉    | 80.0M/134M [00:06<00:03, 14.4MB/s]
adapter_model.safetensors:  72%|███████▏  | 96.0M/134M [00:07<00:02, 17.3MB/s]
adapter_model.safetensors:  83%|████████▎ | 112M/134M [00:07<00:01, 21.1MB/s] 
adapter_model.safetensors:  95%|█████████▌| 128M/134M [00:07<00:00, 24.6MB/s]
adapter_model.safetensors: 100%|██████████| 134M/134M [00:08<00:00, 16.4MB/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 1.25MB/s]
Model pushed successfully!
```


## This script is optimized to run on NVIDIA A100 GPUs. Specifically, the GPU resource used is: GPU: NVIDIA A100, 80GB


# License

This repository is licensed under the [MIT License](LICENSE).

# Contact

For questions or issues, please contact [author](aliforootan@ieee.org).

---

This README provides a comprehensive guide to understanding and utilizing the script for fine-tuning the LLaMA-2-7b model using advanced techniques. Adjust file paths and parameters as needed based on your specific requirements.


# Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->