<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/Crawl_a_Website.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In [None]:
!pip install -q llama-index==0.10.57 llama-index-llms-gemini==0.1.11 openai==1.37.0 google-generativeai==0.5.4 newspaper3k==0.2.8

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.6/97.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for tinysegmenter (setup.py) ... [?25l[?25hdone
  Building wheel for feedfinder2 (setup.py) ... [?25l[?25hdone
  Building wheel for jieba3k (setup.py) ... [?25l[?25hdone
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


In [None]:
import os

# Set the following API Keys in the Python environment. Will be used later.
os.environ["OPENAI_API_KEY"] = "[OPENAI_API_KEY]"
USESCRAPER_API_KEY = "[USESCRAPER_API_KEY]"

There are two primary methods for extracting webpage content. The first method involves having a list of URLs; one can iterate through this list to retrieve the content of each page. The second method, web crawling, requires using a script or service to extract page URLs from a sitemap or manually following links on the page to access all the content. Initially, we will explore web scraping techniques before discussing how to use a service like usescraper.com to perform web crawling.


# 1. Scraping using `newspaper` Library


## Define URLs


In [None]:
urls = [
    "https://docs.llamaindex.ai/en/stable/understanding",
    "https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms/",
    "https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/",
    "https://docs.llamaindex.ai/en/stable/understanding/querying/querying/",
]

## Get Page Contents


In [None]:
import newspaper

pages_content = []

# Retrieve the Content
for url in urls:
    try:
        article = newspaper.Article(url)
        article.download()
        article.parse()
        if len(article.text) > 0:
            pages_content.append(
                {"url": url, "title": article.title, "text": article.text}
            )
    except:
        continue

In [None]:
pages_content[0]

{'url': 'https://docs.llamaindex.ai/en/stable/understanding',
 'title': 'Building an LLM Application',
 'text': "Building an LLM application#\n\nWelcome to the beginning of Understanding LlamaIndex. This is a series of short, bite-sized tutorials on every stage of building an LLM application to get you acquainted with how to use LlamaIndex before diving into more advanced and subtle strategies. If you're an experienced programmer new to LlamaIndex, this is the place to start.\n\nKey steps in building an LLM application#\n\nTip If you've already read our high-level concepts page you'll recognize several of these steps.\n\nThere are a series of key steps involved in building any LLM-powered application, whether it's answering questions about your data, creating a chatbot, or an autonomous agent. Throughout our documentation, you'll notice sections are arranged roughly in the order you'll perform these steps while building your app. You'll learn about:\n\nUsing LLMs : whether it's OpenAI 

In [None]:
len(pages_content)

5

## Convert to Document


In [None]:
from llama_index.core.schema import Document

# Convert the chunks to Document objects so the LlamaIndex framework can process them.
documents = [
    Document(text=row["text"], metadata={"title": row["title"], "url": row["url"]})
    for row in pages_content
]

# 2. Submit the Crawler Job


In [None]:
import requests
import json

payload = {
    "urls": [
        "https://docs.llamaindex.ai/en/stable/understanding/"
    ],  # list of urls to crawl
    "output_format": "markdown",  # text, html, markdown
    "output_expiry": 604800,  # Automatically delete after X seconds
    "min_length": 50,  # Skip pages with less than X characters
    "page_limit": 10000,  # Maximum number of pages to crawl
    "force_crawling_mode": "link",  # "link" follows links in the page reccursively, or "sitemap" to find pages from website's sitemap
    "block_resources": True,  # skip loading images, stylesheets, or scripts
    "include_linked_files": False,  # include files (PDF, text, ...) in output
}
headers = {
    "Authorization": "Bearer " + USESCRAPER_API_KEY,
    "Content-Type": "application/json",
}

response = requests.request(
    "POST", "https://api.usescraper.com/crawler/jobs", json=payload, headers=headers
)

response = json.loads(response.text)

print(response)

{'org': '581', 'id': '7YE3T8VSPJVSCYE6EDQ90DJNFT', 'urls': ['https://docs.llamaindex.ai/en/stable/understanding/'], 'exclude_globs': [], 'exclude_elements': 'nav, header, footer, script, style, noscript, svg, [role="alert"], [role="banner"], [role="dialog"], [role="alertdialog"], [role="region"][aria-label*="skip" i], [aria-modal="true"]', 'output_format': 'markdown', 'output_expiry': 604800, 'min_length': 50, 'page_limit': 10000, 'force_crawling_mode': 'link', 'block_resources': True, 'include_linked_files': False, 'createdAt': 1713883978029, 'status': 'starting', 'use_browser': True, 'sitemapPageCount': 0, 'notices': []}


## Get the Status


In [None]:
url = "https://api.usescraper.com/crawler/jobs/{}".format(response["id"])

status_res = requests.request("GET", url, headers=headers)

status_res = json.loads(status_res.text)

print(status_res["status"])
print(status_res["progress"])

running
{'scraped': 9, 'discarded': 0, 'failed': 0}


## Get the Data


In [None]:
url = "https://api.usescraper.com/crawler/jobs/{}/data".format(response["id"])

data_res = requests.request("GET", url, headers=headers)

data_res = json.loads(data_res.text)

print(data_res)

In [None]:
print("URL:", data_res["data"][0]["meta"]["url"])
print("Title:", data_res["data"][0]["meta"]["meta"]["title"])
print("Content:", data_res["data"][0]["text"][0:500], "...")

URL: https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/graphs/
Title: Knowledge Graphs - LlamaIndex
Content:  
[ Skip to content ](https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/graphs/#knowledge-graphs)
#Knowledge Graphs[#](https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/graphs/#knowledge-graphs)
LlamaIndex contains some fantastic guides for building with knowledge graphs.

Check out the end-to-end tutorials/workshops below. Also check out our [knowledge graph query engine guides](https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_ ...


## Convert to Document


In [None]:
from llama_index.core.schema import Document

# Convert the chunks to Document objects so the LlamaIndex framework can process them.
documents = [
    Document(
        text=row["text"],
        metadata={"title": row["meta"]["meta"]["title"], "url": row["meta"]["url"]},
    )
    for row in data_res["data"]
]

# Create RAG Pipeline


In [None]:
from llama_index.llms.gemini import Gemini

llm = Gemini(model="models/gemini-1.5-flash", temperature=1, max_tokens=512)

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

In [None]:
from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=30)

In [None]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

In [None]:
query_engine = index.as_query_engine()

In [None]:
res = query_engine.query("What is a query engine?")

In [None]:
res.response

'A query engine is a fundamental component used in querying processes. It is responsible for retrieving the most relevant documents from an index based on a query, postprocessing the retrieved nodes if needed, and then synthesizing a response by combining the query, relevant data, and prompt to be sent to the language model for generating an answer.'

In [None]:
# Show the retrieved nodes
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Title\t", src.metadata["title"])
    print("URL\t", src.metadata["url"])
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 081b6c8c-d9ea-4476-bac0-1008facd3db8
Title	 Querying - LlamaIndex
URL	 https://docs.llamaindex.ai/en/stable/understanding/querying/querying/
Score	 0.46212738505767387
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Node ID	 3786c195-c5de-4bba-98b6-996031349a88
Title	 Querying - LlamaIndex
URL	 https://docs.llamaindex.ai/en/stable/understanding/querying/querying/
Score	 0.43141762602042416
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
