!pip install flash_attn
!pip install Pillow
!pip install Requests
!pip install torchvision
!pip install transformers
!pip install accelerate
!pip install peft
!pip install datasets
!pip install evaluate
!pip install sacrebleu
!pip install rouge_score
!pip install -q -U bitsandbytes

Notebook Overview

In this notebook, we will cover the following sections:

Load Libraries:
Import necessary libraries for data handling, model fine-tuning, and evaluation.
Load Dataset:
Load an open dataset from HuggingFace: Multimodal-Fatima/FGVC_Aircraft_train and Multimodal-Fatima/FGVC_Aircraft_test. This dataset contains images and corresponding textual descriptions.
Load Model and Processor:
Load a pre-trained multimodal model (Phi-3-vision-128k-instruct) and its associated processor from HuggingFace's model hub.
Inference with Base Model:
Run inference using the base model to generate captions for the images. This serves as a baseline before fine-tuning.
Fine-tuning:
Go through the process of fine-tuning the model:

Prepare Dataset: Preprocess the dataset for training.
Setup DataCollator: Create a custom DataCollator to handle batching and preprocessing.
Setup LoRA: Configure LoRA (Low-Rank Adaptation) for efficient training.
Training: Fine-tune the model on the training dataset.

Inference with Finetuned Model: Run inference using the fine-tuned model to generate captions and evaluate the improvements.

1. Load Libraries

import torch
torch.version

import warnings
warnings.filterwarnings("ignore")

import os
import numpy as np
import pandas as pd

import base64
import requests
from PIL import Image
from io import BytesIO

import transformers
from peft import LoraConfig, get_peft_model, PeftModel

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, random_split
from torchvision.transforms.functional import resize, to_pil_image
from torchvision import transforms

from datasets import load_dataset
import evaluate

import matplotlib.pyplot as plt
from textwrap import wrap

#from huggingface_hub import notebook_login
#notebook_login()

wandb for experiments tracking. Comment out if you don't use wandb

import wandb

wandb.login()

torch.manual_seed(3)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dtype=torch.bfloat16

batch_size = 1
base_model_id = "microsoft/Phi-3-vision-128k-instruct"
model_dir = "models/peft_adapter"

!rm -rf models

os.makedirs(model_dir, exist_ok=True)

2. Load Dataset

Now let's load an open dataset from Hugging Face: Multimodal-Fatima/FGVC_Aircraft_train and Multimodal-Fatima/FGVC_Aircraft_test. This dataset contains images and corresponding textual descriptions. We will only use the image and clip_tags_ViT_L_14 features for our fine-tuning task.

raw_train_dataset = load_dataset("Multimodal-Fatima/FGVC_Aircraft_train")
raw_test_dataset = load_dataset("Multimodal-Fatima/FGVC_Aircraft_test")

print(raw_train_dataset)

Let's focus on a subset of airplane models and make sure the training dataset includes a representative set of airplanes. The training dataset should cover all airplane models present in the testing dataset.

def filter_by_values(record, filtering_values, filtering_field):
return any(model in record[filtering_field] for model in filtering_values)

filtering_values = ["boeing 707","boeing 737","boeing 777","boeing 787"]
filtering_field = "clip_tags_ViT_L_14"

filtered_train_dataset = raw_train_dataset.filter(lambda x: filter_by_values(x, filtering_values, filtering_field))
filtered_test_dataset = raw_test_dataset.filter(lambda x: filter_by_values(x, filtering_values, filtering_field))

print(filtered_train_dataset)

print(filtered_test_dataset)

count = 0
for idx, row in enumerate(filtered_train_dataset["train"]):
print(f"Row {idx + 1}: {row[filtering_field]}")
count += 1
if count==10:
break

count = 0
for idx, row in enumerate(filtered_test_dataset["test"]):
print(f"Row {idx + 1}: {row[filtering_field]}")
count += 1
if count==10:
break

3. Load Model and Processor

Let's load a pre-trained model Phi-3-vision-128k-instruct and its associated processor from HuggingFace's model hub. This model will be used for both initial inference and subsequent fine-tuning.

import json
import os
from pprint import pprint
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from huggingface_hub import notebook_login
from peft import (
LoraConfig,
PeftConfig,
PeftModel,
get_peft_model,
prepare_model_for_kbit_training
)
from transformers import (
AutoConfig,
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig
)

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=dtype,
device_map="auto",
trust_remote_code=True,
_attn_implementation='eager',
quantization_config=bnb_config
)

processor = transformers.AutoProcessor.from_pretrained(base_model_id, trust_remote_code=True)

model = model.to(device)

print(model)

4. Inference with Base Model

Next let's perform inference using Phi-3-vision-128k-instruct to generate captions for airplane images. This process involves creating an input prompt, generating response from the model, and decoding this response to obtain human-readable captions. The input prompt is constructed using a predefined chat template that helps format the input for the model. Conducting this inference serves as a baseline to compare the performance before and after fine-tuning the model.

filtered_test_dataset["test"][id]["image"].convert("RGB")

filtered_test_dataset["test"][id]['clip_tags_ViT_L_14']

id = 1

image = filtered_test_dataset["test"][id]["image"].convert("RGB")
description = ",".join(filtered_test_dataset["test"][id]['clip_tags_ViT_L_14'])

print(f"CHAT_TEMPLATE: \n{processor.tokenizer.chat_template}")

userPrompt = "Generate a concise caption for this image, mentioning specific airplane types"
messages = [
{"role": "user", "content": f"<|image_1|>\n{userPrompt}"}
]

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

print(f"\nPROMPT: \n{prompt}")

inputs = processor(prompt, [image], return_tensors="pt").to(device)

generation_args = {
"max_new_tokens": 512,
"temperature": 0.0
}

generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

generate_ids = generate_ids[:, inputs["input_ids"].shape[1]:]
response = processor.tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

plt.imshow(image)
plt.axis("off")
plt.show()

print(f"\nIMAGE DESCRIPTION:\n{description}")
print(f"\nMODEL_RESPONSE: \n{response}")

The base model provides a general response to the image, which can be seen in the generated caption. However, this response has several limitations when it comes to the precise identification of specific airplane types and relevant details.

5. Fine-tuning

5.1 Prepare Dataset

split_dataset = filtered_train_dataset["train"].train_test_split(test_size=0.2, seed=42)

train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]
test_dataset = filtered_test_dataset["test"]

columns_to_keep = ["image", "clip_tags_ViT_L_14"]

train_dataset = train_dataset.remove_columns([col for col in train_dataset.column_names if col not in columns_to_keep])
val_dataset = val_dataset.remove_columns([col for col in val_dataset.column_names if col not in columns_to_keep])
test_dataset = test_dataset.remove_columns([col for col in test_dataset.column_names if col not in columns_to_keep])

train_dataset

val_dataset

test_dataset

Code taken from https://huggingface.co./docs/transformers/main/en/tasks/image_captioning

def plot_images(images, captions):
plt.figure(figsize=(20, 20))
for i in range(len(images)):
ax = plt.subplot(1, len(images), i + 1)
caption = captions[i]
caption = "\n".join(wrap(",".join(caption), 30))
plt.title(caption)
plt.imshow(images[i])
plt.axis("off")

sample_images_to_visualize = [np.array(train_dataset[i]["image"]) for i in range(5)]
sample_captions = [train_dataset[i]["clip_tags_ViT_L_14"] for i in range(5)]
plot_images(sample_images_to_visualize, sample_captions)

5.2 Setup DataCollator

Now let's define the data collator. It handles the batching and preprocessing of input data, and it prepares the input prompts and corresponding labels, ensuring they are correctly formatted and tokenized for the model. By customizing the DataCollator, we can efficiently manage complex data structures and input-output relationships required for our multi-modal fine-tuning task. This setup is crucial for creating consistent and coherent batches during training and evaluation.

class DataCollator:
def init(self, processor):
self.processor = processor

def __call__(self, examples):
    example = examples[0]

    image = example["image"]

    user_prompt = "Generate a concise caption for this image, mentioning specific airplane types"
    answer = ",".join(example["clip_tags_ViT_L_14"])

    messages = [
        {"role": "user", "content": f"<|image_1|>\n{user_prompt}"}
    ]

    prompt = self.processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    answer = f"{answer}<|end|>\n<|endoftext|>"

    # Mask user_prompts for labels
    batch = self.processor(prompt, [image], return_tensors="pt")
    prompt_input_ids = batch["input_ids"]

    answer_input_ids = self.processor.tokenizer(answer, add_special_tokens=False, return_tensors="pt")["input_ids"]

    concatenated_input_ids = torch.cat([prompt_input_ids, answer_input_ids], dim=1)
    ignore_index = -100
    labels = torch.cat(
        [
            torch.tensor([ignore_index] * len(prompt_input_ids[0])).unsqueeze(0),
            answer_input_ids,
        ],
        dim=1,
    )

    batch["input_ids"] = concatenated_input_ids
    batch["labels"] = labels

    # Ensure only floating-point tensors require gradients
    for key, value in batch.items():
        if isinstance(value, torch.Tensor) and torch.is_floating_point(value):
            batch[key] = value.clone().detach().requires_grad_(True)

    return batch

data_collator = DataCollator(processor)

print(train_dataset[1])
type(train_dataset[1])

examples = [train_dataset[i] for i in range(5)]
collator_output = data_collator(examples)
print(collator_output.keys())

5.3 Setup LoRA

Next, let's prepare the model for fine-tuning using LoRA (Low-Rank Adaptation). LoRA is a technique used to efficiently fine-tune transformer models by introducing trainable low-rank matrices into the layers of the model. This approach reduces the number of parameters that need to be updated during training, making the process more memory-efficient and faster.

We will apply LoRA specifically to the attention mechanisms and feed-forward operations of the model. To identify the relevant target modules, we will use the command model.state_dict().keys() to inspect the names of the modules within the model. The modules related to attention mechanisms and feed-forward operations are then specified in the LoRA configuration, ensuring that the fine-tuning process is focused on these components.

model.resize_token_embeddings(len(processor.tokenizer))
model.gradient_checkpointing_enable()

model = prepare_model_for_kbit_training(model)

model.state_dict().keys()

lora_config = LoraConfig(
r=4,
lora_alpha=16,
target_modules=[
"self_attn.q_proj.weight",
"self_attn.k_proj.weight",
"self_attn.v_proj.weight",
"self_attn.qkv_proj.weight",
"self_attn.out_proj.weight",
"mlp.gate_up_proj",
"mlp.down_proj"
],
lora_dropout=0.06,
bias="none",
task_type="CAUSAL_LM",
use_dora=False
)

peft_model = get_peft_model(model, lora_config)

train_dataset.start_iteration = 0

def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()

print(
    f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)

print_trainable_parameters(peft_model)

5.4 Training

training_args = transformers.TrainingArguments(
num_train_epochs=1, # Number of training epochs
per_device_train_batch_size=batch_size, # Batch size for training
per_device_eval_batch_size=batch_size, # Batch size for evaluation
gradient_accumulation_steps=6, # Number of steps to accumulate gradients before updating
gradient_checkpointing=True, # Enable gradient checkpointing to save memory
do_eval=True, # Perform evaluation during training
save_total_limit=2, # Limit the total number of saved checkpoints
evaluation_strategy="steps", # Evaluation strategy to use (here, at each specified number of steps)
save_strategy="steps", # Save checkpoints at each specified number of steps
save_steps=10, # Number of steps between each checkpoint save
eval_steps=10, # Number of steps between each evaluation
max_grad_norm=1, # Maximum gradient norm for clipping
warmup_ratio=0.1, # Warmup ratio for learning rate schedule
weight_decay=0.01, # Regularization technique to prevent overfitting
# fp16=True, # Enable mixed precision training with fp16 (enable it if Ampere architecture is unavailable)
bf16=True, # Enable mixed precision training with bf16
logging_steps=10, # Number of steps between each log
output_dir="outputs", # Directory to save the model outputs and checkpoints
optim="adamw_torch", # Optimizer to use (AdamW with PyTorch)
learning_rate=1e-4, # Learning rate for the optimizer
lr_scheduler_type="constant", # Learning rate scheduler type
load_best_model_at_end=True, # Load the best model found during training at the end
metric_for_best_model="rouge", # Metric used to determine the best model
greater_is_better=True, # Indicates if a higher metric score is better
push_to_hub=False, # Whether to push the model to Hugging Face Hub
run_name="phi-3-vision-finetuning", # Name of the run for experiment tracking
# report_to="wandb" # For experiment tracking (login to Weights & Biases needed)
)

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
logits, labels = eval_pred
predicted = logits.argmax(-1)
labels = np.where(labels != -100, labels, processor.tokenizer.pad_token_id)

decoded_labels = processor.batch_decode(labels, skip_special_tokens=True)
decoded_predictions = processor.batch_decode(predicted, skip_special_tokens=True)
rouge_scores = rouge.compute(predictions=decoded_predictions, references=decoded_labels)
rouge1_score = rouge_scores["rouge1"]
return {"rouge": rouge1_score}

predictions = ["A large commercial airplane with a blue tail and a red logo, possibly a Qantas Boeing 747, is taking off from an airport runway."]
references = ["tupolev sb,707,boeing 707,douglas dc-8,boeing 2707"]

rouge_score = rouge.compute(predictions=predictions, references=references)
print(rouge_score)

class CustomTrainer(transformers.Trainer):
def get_train_dataloader(self):
# Ensure the DataLoader uses your custom DataCollator
return DataLoader(
self.train_dataset,
batch_size=self.args.train_batch_size,
collate_fn=self.data_collator,
drop_last=self.args.dataloader_drop_last,
num_workers=self.args.dataloader_num_workers,
)

def get_eval_dataloader(self, eval_dataset=None):
    # Ensure the DataLoader uses your custom DataCollator for evaluation
    eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
    return DataLoader(
        eval_dataset,
        batch_size=self.args.eval_batch_size,
        collate_fn=self.data_collator,
        drop_last=self.args.dataloader_drop_last,
        num_workers=self.args.dataloader_num_workers,
    )

def compute_loss(self, model, inputs, return_outputs=False):
    print("Input shapes:", {k: v.shape for k, v in inputs.items()})
    outputs = model(**inputs)
    loss = outputs.loss if isinstance(outputs, dict) else outputs[0]
    return (loss, outputs) if return_outputs else loss

Ensure the model is in training mode

peft_model.train()

trainer = CustomTrainer(
model=peft_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
# callbacks=[early_stopping]
)

peft_model.config.use_cache = False

trainer.train()

Input shapes: {'input_ids': torch.Size([1, 1972]), 'attention_mask': torch.Size([1, 1949]), 'pixel_values': torch.Size([1, 17, 3, 336, 336]), 'image_sizes': torch.Size([1, 2]), 'labels': torch.Size([1, 1972])}

RuntimeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 trainer.train()

29 frames
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py in to_4d(self, attention_mask_2d, query_length, dtype, key_value_length)
137
138 if causal_4d_mask is not None:
--> 139 expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
140
141 # expanded_attn_mask + causal_4d_mask can cause some overflow

RuntimeError: The size of tensor a (1949) must match the size of tensor b (1972) at non-singleton dimension 3

getting this issue on trainer.train()

microsoft
/

Phi-3-vision-128k-instruct

input_ids and attention mask are not of equal size. can someone help me