<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/07-Finetune_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages and Setup Variables

In [17]:
!pip install -q llama-index==0.9.21 openai==1.6.0 cohere==4.39 html2text==2020.1.16 sentence_transformers==2.2.2

In [18]:
# Test with a few sample, processing dataset fully can be costly depanding on the size.
# NOTE: Checkpoints are provided in the lesson, so no need to run the code on full dataset.
testing = True

In [19]:
import os

# Set the "OPENAI_API_KEY" in the Python environment. Will be used by OpenAI client later.
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"

# Load the Dataset (Webpages)

## Download

In [20]:
TRAIN_URLs = [
    "https://towardsai.net/p/machine-learning/metas-llama-2-revolutionizing-open-source-language-models-for-commercial-use",
    "https://towardsai.net/p/machine-learning/fine-tuning-a-llama-2-7b-model-for-python-code-generation",
    "https://towardsai.net/p/machine-learning/how-to-create-llama-2-chatbot-with-gradio-and-hugging-face-in-free-colab",
    "https://towardsai.net/p/machine-learning/meta-releases-llama-2-free-for-commercial-use",
    "https://towardsai.net/p/machine-learning/gpt-4-llama-2-claude-how-different-language-models-react-to-prompts",
    "https://towardsai.net/p/machine-learning/a-simple-hugging-face-guide-to-chatting-with-the-llama-2-7b-model-in-a-colab-notebook",
    "https://towardsai.net/p/machine-learning/fine-tuning-a-llama-2-7b-model-for-python-code-generation",
    "https://towardsai.net/p/machine-learning/llamaindex-last-version-from-basics-to-advanced-techniques-in-python-part-3",
    "https://towardsai.net/p/machine-learning/meta-releases-llama-will-it-fail-too",
    "https://towardsai.net/p/machine-learning/llama-by-meta-leaked-by-an-anonymous-forum-questions-arises-on-meta"
]
VALIDATION_URLs = [
    "https://towardsai.net/p/machine-learning/deep-diving-into-llama-2-meta-ai-new-open-source-foundation-model",
    "https://towardsai.net/p/machine-learning/gptq-quantization-on-a-llama-2-7b-fine-tuned-model-with-huggingface",
    "https://towardsai.net/p/machine-learning/powerinfer-11x-speed-up-llama-ii-inference-on-a-local-gpu",
    "https://towardsai.net/p/machine-learning/dense-x-retrieval-technique-in-langchain-and-llamaindex",
    "https://towardsai.net/p/machine-learning/exploring-large-language-models-part-2",
    "https://towardsai.net/p/machine-learning/inside-code-llama-meta-ais-entrance-in-the-code-llm-space",
    "https://towardsai.net/p/machine-learning/llamaindex-use-the-power-of-llms-on-your-data",
    "https://towardsai.net/p/l/inside-llama-meta-ai-new-large-language-model-that-outperforms-gpt-3-across-many-tasks"
]

## Read the Page

In [21]:
from llama_index.readers import SimpleWebPageReader

# Read the content of webpage into lists. We need two sets of documents for Training, and Validation.
TRAIN_DOCs = SimpleWebPageReader(html_to_text=True).load_data(TRAIN_URLs)
VALIDATION_DOCs = SimpleWebPageReader(html_to_text=True).load_data(VALIDATION_URLs)
print( len(TRAIN_DOCs), len(VALIDATION_DOCs) )

10 8


# Chunking

In [22]:
from llama_index.node_parser import SimpleNodeParser

# Define a parser to perform the chunking process.
parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=20)

# Apply chunking on the training/validation sets.
TRAIN_NODEs = parser.get_nodes_from_documents(TRAIN_DOCs)
VALIDATION_NODEs = parser.get_nodes_from_documents(VALIDATION_DOCs)
print( len( TRAIN_NODEs ), len( VALIDATION_NODEs ) )

274 222


In [14]:
# Use a subset of the dataset (5 samples) if testing.
if testing:
  TRAIN_NODEs = TRAIN_NODEs [0:5]
  VALIDATION_NODEs = VALIDATION_NODEs[0:5]

# Generate Question

We use a Large Language Model (LLM) to produce questions for each chunk of the dataset. Then we can use these data to train the model to develop embeddings that more accurately represent the types of questions users may ask.

In [3]:
# Use this block of code if you don't want to generate the questions for the dataset. (Avoid API call charges!)
# Uncomment the following code, and keep in mind to comment all the contents in the next coding block.

# from llama_index.finetuning import EmbeddingQAFinetuneDataset

# # Load the pre-generated questions json files.
# TRAIN_DATASET = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
# VALIDATION_DATASET = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

In [17]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.llms import OpenAI

# Load the OpenAI API with the "gpt-3.5-turbo" model
llm = OpenAI()

# Generate questions for each chunk.
TRAIN_DATASET = generate_qa_embedding_pairs(TRAIN_NODEs, llm=llm)
VALIDATION_DATASET = generate_qa_embedding_pairs(VALIDATION_NODEs, llm=llm)

TRAIN_DATASET.save_json("train_dataset.json")
VALIDATION_DATASET.save_json("val_dataset.json")

100%|██████████| 273/273 [05:15<00:00,  1.16s/it]
100%|██████████| 222/222 [04:12<00:00,  1.14s/it]


# Load an Embedding Model

In [4]:
from llama_index.embeddings import resolve_embed_model

# Load an existing embedding model with a linear layer adopter on top.
base_embed_model = resolve_embed_model("local:BAAI/bge-small-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co./settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [5]:
from llama_index.finetuning import EmbeddingAdapterFinetuneEngine
import torch

# Finetune the adapter
finetune_engine = EmbeddingAdapterFinetuneEngine(
    TRAIN_DATASET,
    base_embed_model,
    model_output_path="model_output_test",
    epochs=4,
    verbose=True,
)

In [6]:
# Initiate the Finetuning process
finetune_engine.finetune()

[1;3;34m> Prepared optimizer, scheduler, and loss model.
[0m

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/55 [00:00<?, ?it/s]

[1;3;34m> [Epoch 0] Current loss: 1.6320068836212158
[0m[1;3;34m> [Epoch 0] Current loss: 1.756566047668457
[0m[1;3;34m> [Epoch 0] Current loss: 1.974733591079712
[0m[1;3;34m> [Epoch 0] Current loss: 0.904658317565918
[0m[1;3;34m> [Epoch 0] Current loss: 1.156891942024231
[0m[1;3;34m> [Epoch 0] Current loss: 1.5102888345718384
[0m[1;3;34m> [Epoch 0] Current loss: 1.4786968231201172
[0m[1;3;34m> [Epoch 0] Current loss: 1.289185643196106
[0m[1;3;34m> [Epoch 0] Current loss: 1.3490561246871948
[0m[1;3;34m> [Epoch 0] Current loss: 1.3077905178070068
[0m[1;3;34m> [Epoch 0] Current loss: 0.9143151044845581
[0m[1;3;34m> [Epoch 0] Current loss: 1.1586413383483887
[0m[1;3;34m> [Epoch 0] Current loss: 1.6070935726165771
[0m[1;3;34m> [Epoch 0] Current loss: 1.464544415473938
[0m[1;3;34m> [Epoch 0] Current loss: 1.2603508234024048
[0m[1;3;34m> [Epoch 0] Current loss: 1.0440889596939087
[0m[1;3;34m> [Epoch 0] Current loss: 1.147850751876831
[0m[1;3;34m> [Epoch 0] 

Iteration:   0%|          | 0/55 [00:00<?, ?it/s]

[1;3;34m> [Epoch 1] Current loss: 1.5881574153900146
[0m[1;3;34m> [Epoch 1] Current loss: 1.7358734607696533
[0m[1;3;34m> [Epoch 1] Current loss: 1.9496583938598633
[0m[1;3;34m> [Epoch 1] Current loss: 0.8745635747909546
[0m[1;3;34m> [Epoch 1] Current loss: 1.111528754234314
[0m[1;3;34m> [Epoch 1] Current loss: 1.4749138355255127
[0m[1;3;34m> [Epoch 1] Current loss: 1.4362574815750122
[0m[1;3;34m> [Epoch 1] Current loss: 1.2734665870666504
[0m[1;3;34m> [Epoch 1] Current loss: 1.3329273462295532
[0m[1;3;34m> [Epoch 1] Current loss: 1.286020040512085
[0m[1;3;34m> [Epoch 1] Current loss: 0.880274772644043
[0m[1;3;34m> [Epoch 1] Current loss: 1.1097452640533447
[0m[1;3;34m> [Epoch 1] Current loss: 1.5711208581924438
[0m[1;3;34m> [Epoch 1] Current loss: 1.4261044263839722
[0m[1;3;34m> [Epoch 1] Current loss: 1.2453088760375977
[0m[1;3;34m> [Epoch 1] Current loss: 1.0061759948730469
[0m[1;3;34m> [Epoch 1] Current loss: 1.0995166301727295
[0m[1;3;34m> [Epoch

Iteration:   0%|          | 0/55 [00:00<?, ?it/s]

[1;3;34m> [Epoch 2] Current loss: 1.550796389579773
[0m[1;3;34m> [Epoch 2] Current loss: 1.7150386571884155
[0m[1;3;34m> [Epoch 2] Current loss: 1.9233592748641968
[0m[1;3;34m> [Epoch 2] Current loss: 0.8541108965873718
[0m[1;3;34m> [Epoch 2] Current loss: 1.0777015686035156
[0m[1;3;34m> [Epoch 2] Current loss: 1.4487472772598267
[0m[1;3;34m> [Epoch 2] Current loss: 1.4066628217697144
[0m[1;3;34m> [Epoch 2] Current loss: 1.2605724334716797
[0m[1;3;34m> [Epoch 2] Current loss: 1.3200373649597168
[0m[1;3;34m> [Epoch 2] Current loss: 1.2688066959381104
[0m[1;3;34m> [Epoch 2] Current loss: 0.8603048324584961
[0m[1;3;34m> [Epoch 2] Current loss: 1.0778127908706665
[0m[1;3;34m> [Epoch 2] Current loss: 1.5469229221343994
[0m[1;3;34m> [Epoch 2] Current loss: 1.4023314714431763
[0m[1;3;34m> [Epoch 2] Current loss: 1.235028862953186
[0m[1;3;34m> [Epoch 2] Current loss: 0.9840642809867859
[0m[1;3;34m> [Epoch 2] Current loss: 1.0698928833007812
[0m[1;3;34m> [Epoc

Iteration:   0%|          | 0/55 [00:00<?, ?it/s]

[1;3;34m> [Epoch 3] Current loss: 1.5297434329986572
[0m[1;3;34m> [Epoch 3] Current loss: 1.7028414011001587
[0m[1;3;34m> [Epoch 3] Current loss: 1.9069995880126953
[0m[1;3;34m> [Epoch 3] Current loss: 0.8438453674316406
[0m[1;3;34m> [Epoch 3] Current loss: 1.0594712495803833
[0m[1;3;34m> [Epoch 3] Current loss: 1.434706449508667
[0m[1;3;34m> [Epoch 3] Current loss: 1.3906594514846802
[0m[1;3;34m> [Epoch 3] Current loss: 1.2529596090316772
[0m[1;3;34m> [Epoch 3] Current loss: 1.3124738931655884
[0m[1;3;34m> [Epoch 3] Current loss: 1.2587229013442993
[0m[1;3;34m> [Epoch 3] Current loss: 0.8505775332450867
[0m[1;3;34m> [Epoch 3] Current loss: 1.0611050128936768
[0m[1;3;34m> [Epoch 3] Current loss: 1.533494472503662
[0m[1;3;34m> [Epoch 3] Current loss: 1.3900038003921509
[0m[1;3;34m> [Epoch 3] Current loss: 1.229201078414917
[0m[1;3;34m> [Epoch 3] Current loss: 0.9730359315872192
[0m[1;3;34m> [Epoch 3] Current loss: 1.0548118352890015
[0m[1;3;34m> [Epoch

In [7]:
embed_model = finetune_engine.get_finetuned_model()

# Or, import model from the directory whenever required.
# from llama_index.embeddings import LinearAdapterEmbeddingModel
# embed_model = LinearAdapterEmbeddingModel(base_embed_model, "model_output_test")

In [8]:
embed_model

AdapterEmbeddingModel(model_name='Adapter for BAAI/bge-small-en-v1.5', embed_batch_size=10, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7ea3a4ccc460>)

# Evaluate

## Define the Evaluation Functions

Hit-rate metric: For each (query, context) pair, we retrieve the top-k documents with the query. It’s a hit if the results contain the ground-truth context.

In [9]:
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.schema import TextNode
from tqdm.notebook import tqdm

def evaluate( dataset, embed_model, top_k=5, verbose=False):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    # Chunking the documents and generating embeddings
    service_context = ServiceContext.from_defaults(embed_model=embed_model)
    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes, service_context=service_context, show_progress=True
    )

    # Define a retrieveer to answer the questions
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    # Look into the each response sources to see if the chunk that contains answer is retrieved.
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

## OpenAI

In [10]:
from llama_index.embeddings import OpenAIEmbedding

# Load the OpenAI Ada model and evaluate it.
ada = OpenAIEmbedding()
ada_val_results = evaluate(VALIDATION_DATASET, ada)

Generating embeddings:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/444 [00:00<?, ?it/s]

In [11]:
import pandas as pd

df_ada = pd.DataFrame(ada_val_results)
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada

0.5135135135135135

## BAAI Model

In [13]:
# Load the Base model without fine-tuning
base_embed_model = resolve_embed_model("local:BAAI/bge-small-en-v1.5")
bge_val_results = evaluate(VALIDATION_DATASET, base_embed_model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co./settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Generating embeddings:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/444 [00:00<?, ?it/s]

In [14]:
df_bge = pd.DataFrame(bge_val_results)
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge

0.5

## FineTuned

In [15]:
from llama_index.embeddings import LinearAdapterEmbeddingModel

# Load the Fine-tuned model.
embed_model = LinearAdapterEmbeddingModel(base_embed_model, "model_output_test")

val_results_finetuned = evaluate(VALIDATION_DATASET, embed_model)

Generating embeddings:   0%|          | 0/222 [00:00<?, ?it/s]

  0%|          | 0/444 [00:00<?, ?it/s]

In [16]:
df_finetuned = pd.DataFrame(val_results_finetuned)
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

0.536036036036036