<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/03-RAG_with_LlamaIndex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Install Packages and Setup Variables


In [None]:
!pip install -q llama-index==0.10.57 llama-index-llms-gemini==0.1.11 openai==1.37.0 google-generativeai==0.7.2

In [1]:
import os
import time
from IPython.display import Markdown, display

# Set the following API Keys in the Python environment. Will be used later.
# We use OpenAI for the embedding model and Gemini-1.5-flash as our LLM.
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"
os.environ["GOOGLE_API_KEY"] = "<YOUR_API_KEY>"

# Load Dataset


## Download


The dataset includes a subset of the documentation from the Llama-index library.


In [2]:
!curl -o ./llama_index_150k.jsonl https://huggingface.co./datasets/towardsai-buster/llama-index-docs/raw/main/llama_index_data_150k.jsonl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  570k  100  570k    0     0  3407k      0 --:--:-- --:--:-- --:--:-- 3417k


## Read File and create LlamaIndex Documents


In [3]:
from llama_index.core import Document
import json


def create_docs(input_file: str) -> list[Document]:
    with open(input_file, "r") as f:
        documents = []
        for line in f:
            data = json.loads(line)

            documents.append(
                Document(
                    doc_id=data["doc_id"],
                    text=data["content"],
                    metadata={  # type: ignore
                        "url": data["url"],
                        "title": data["name"],
                        "tokens": data["tokens"],
                        "source": data["source"],
                    },
                    excluded_llm_metadata_keys=[
                        "title",
                        "tokens",
                        "source",
                    ],
                    excluded_embed_metadata_keys=[
                        "url",
                        "tokens",
                        "source",
                    ],
                )
            )
    return documents

# Generate Embedding


In [7]:
from llama_index.core import Document

# Convert the texts to Document objects so the LlamaIndex framework can process them.
documents = create_docs("llama_index_150k.jsonl")
print("Number of documents:", len(documents))

Number of documents: 56


In [5]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding


# Build index / generate embeddings using OpenAI embedding model
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
    transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=400)],
    show_progress=True,
)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 56/56 [00:00<00:00, 181.20it/s]
Generating embeddings: 100%|██████████| 375/375 [00:05<00:00, 74.36it/s]


# Query Dataset


In [8]:
# Define a query engine that is responsible for retrieving related pieces of text,
# and using a LLM to formulate the final answer.

from llama_index.llms.gemini import Gemini

llm = Gemini(model="models/gemini-1.5-flash", temperature=1, max_tokens=1000)

query_engine = index.as_query_engine(llm=llm, similarity_top_k=10)

I0000 00:00:1722879021.990521 1763413 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


In [9]:
start = time.time()
response = query_engine.query("How to setup a query engine in code?")
end = time.time()
display(Markdown(response.response))
print("time taken: ", end - start)

I0000 00:00:1722879022.480648 1763413 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


To set up a query engine in code, first create an index from your documents. Then, use the index to create a query engine. You can then query the query engine using the `query` method. 


time taken:  3.4835610389709473


In [10]:
start = time.time()
response = query_engine.query("How to setup an agent in code?")
end = time.time()
display(Markdown(response.response))
print("time taken: ", end - start)

An agent can be set up in code by defining a set of tools and providing them to a `ReActAgent` implementation.


time taken:  3.3619420528411865


# Setup Long Context Caching


For this section, we will be using the Gemini API


In [11]:
# Import the Python SDK
import google.generativeai as genai
from google.generativeai import caching
from google.generativeai import GenerationConfig

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

Convert the jsonl file to a text file for the Gemini API

In [12]:
import json


def create_text_file(input_file: str, output_file: str) -> None:
    with open(input_file, "r") as f, open(output_file, "w") as out:
        for line in f:
            data = json.loads(line)
            out.write(data["content"] + "\n\n")  # Add two newlines between documents

    print(f"Contents saved to {output_file}")


create_text_file("llama_index_150k.jsonl", "llama_index_contents.txt")

Contents saved to llama_index_contents.txt


In [None]:
document = genai.upload_file(path="llama_index_contents.txt")
model_name = "gemini-1.5-flash-001"

cache = genai.caching.CachedContent.create(
    model=model_name,
    system_instruction="You answer questions about the LlamaIndex framework.",
    contents=[document],
)

In [14]:
model = genai.GenerativeModel.from_cached_content(cache)
start = time.time()
response = model.generate_content(
    "How to setup a query engine in code?",
    generation_config=GenerationConfig(max_output_tokens=1000),
)
end = time.time()
display(Markdown(response.text))
print("time taken: ", end - start)

Here's a breakdown of how to set up a query engine in LlamaIndex, along with different methods and explanations:

**1.  The Most Common Approach: Using an Index**

   The simplest way to get a `QueryEngine` is to leverage an existing `Index` object. Each index type in LlamaIndex has an `as_query_engine()` method that creates a specialized engine for that index:

   ```python
   from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

   # Load your data
   documents = SimpleDirectoryReader("data").load_data() 

   # Create a VectorStoreIndex
   index = VectorStoreIndex.from_documents(documents) 

   # Get a query engine
   query_engine = index.as_query_engine() 

   # Now you can use the query engine to ask questions
   response = query_engine.query("What is the main point of this document?")
   print(response)
   ```

**2.  Customization Through Composition: Advanced Query Engines**

   For fine-grained control, you can build a `QueryEngine` from its component parts using the `RetrieverQueryEngine`:

   ```python
   from llama_index.core import VectorStoreIndex, get_response_synthesizer
   from llama_index.core.retrievers import VectorIndexRetriever
   from llama_index.core.query_engine import RetrieverQueryEngine
   from llama_index.core.postprocessor import SimilarityPostprocessor

   # Build your index (as above)
   index = VectorStoreIndex.from_documents(documents) 

   # Configure the retriever
   retriever = VectorIndexRetriever(
       index=index,
       similarity_top_k=10, 
   )

   # Configure the response synthesizer (the core LLM)
   response_synthesizer = get_response_synthesizer()

   # Assemble the query engine
   query_engine = RetrieverQueryEngine(
       retriever=retriever,
       response_synthesizer=response_synthesizer,
       node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
   )

   # Query the engine
   response = query_engine.query("What are the key takeaways from this data?")
   print(response)
   ```

**Key Components and Customization:**

* **Retrieval:**  How your engine finds relevant information from the index (e.g., top-k semantic search, keyword matching, etc.).
* **Postprocessing:**  Additional steps to refine the retrieved results (e.g., reranking, filtering based on metadata, etc.).
* **Response Synthesis:** The LLM used to generate the final response (e.g., OpenAI's GPT-3.5, a local model, etc.).
* **Prompt Engineering:**  Crafting effective prompts to guide your LLM in synthesizing a meaningful answer.

**Types of Query Engines:**

* **RetrieverQueryEngine:** Combines a retriever and response synthesizer for standard question answering.
* **SubQuestionQueryEngine:** Decomposes a complex query into sub-queries, especially suited for multi-document analysis and compare/contrast scenarios.
* **RouterQueryEngine:** Routes a query to the most appropriate index or data source, especially helpful when you have a heterogeneous collection of information.

**Choosing the Right Approach:**

* For straightforward scenarios, using an index's `as_query_engine()` method is the easiest option.
* When you need finer control over retrieval, postprocessing, or the LLM used, create a `RetrieverQueryEngine` and customize its components.

Let me know if you'd like to see a specific type of query engine setup or have more advanced use cases in mind! 


time taken:  32.33650302886963
