{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🚀 Crawl4AI v0.3.72 Release Announcement\n", "\n", "Welcome to the new release of **Crawl4AI v0.3.72**! This notebook highlights the latest features and demonstrates how they work in real-time. Follow along to see each feature in action!\n", "\n", "### What’s New?\n", "- ✨ `Fit Markdown`: Extracts only the main content from articles and blogs\n", "- 🛡️ **Magic Mode**: Comprehensive anti-bot detection bypass\n", "- 🌐 **Multi-browser support**: Switch between Chromium, Firefox, WebKit\n", "- 🔍 **Knowledge Graph Extraction**: Generate structured graphs of entities & relationships from any URL\n", "- 🤖 **Crawl4AI GPT Assistant**: Chat directly with our AI assistant for help, code generation, and faster learning (available [here](https://tinyurl.com/your-gpt-assistant-link))\n", "\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 📥 Setup\n", "To start, we'll install `Crawl4AI` along with Playwright and `nest_asyncio` to ensure compatibility with Colab’s asynchronous environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install Crawl4AI and dependencies\n", "!pip install crawl4ai\n", "!playwright install\n", "!pip install nest_asyncio" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import nest_asyncio and apply it to allow asyncio in Colab\n", "import nest_asyncio\n", "nest_asyncio.apply()\n", "\n", "print('Setup complete!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## ✨ Feature 1: `Fit Markdown`\n", "Extracts only the main content from articles and blog pages, removing sidebars, ads, and other distractions.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "from crawl4ai import AsyncWebCrawler\n", "\n", "async def fit_markdown_demo():\n", " async with AsyncWebCrawler() as crawler:\n", " result = await crawler.arun(url=\"https://janineintheworld.com/places-to-visit-in-central-mexico\")\n", " print(result.fit_markdown) # Shows main content in Markdown format\n", "\n", "# Run the demo\n", "await fit_markdown_demo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## 🛡️ Feature 2: Magic Mode\n", "Magic Mode bypasses anti-bot detection to make crawling more reliable on protected websites.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "async def magic_mode_demo():\n", " async with AsyncWebCrawler() as crawler: # Enables anti-bot detection bypass\n", " result = await crawler.arun(\n", " url=\"https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/\",\n", " magic=True # Enables magic mode\n", " )\n", " print(result.markdown) # Shows the full content in Markdown format\n", "\n", "# Run the demo\n", "await magic_mode_demo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## 🌐 Feature 3: Multi-Browser Support\n", "Crawl4AI now supports Chromium, Firefox, and WebKit. Here’s how to specify Firefox for a crawl.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "async def multi_browser_demo():\n", " async with AsyncWebCrawler(browser_type=\"firefox\") as crawler: # Using Firefox instead of default Chromium\n", " result = await crawler.arun(url=\"https://crawl4i.com\")\n", " print(result.markdown) # Shows content extracted using Firefox\n", "\n", "# Run the demo\n", "await multi_browser_demo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## ✨ Put them all together\n", "\n", "Let's combine all the features to extract the main content from a blog post, bypass anti-bot detection, and generate a knowledge graph from the extracted content." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from crawl4ai.extraction_strategy import LLMExtractionStrategy\n", "from pydantic import BaseModel\n", "import json, os\n", "from typing import List\n", "\n", "# Define classes for the knowledge graph structure\n", "class Landmark(BaseModel):\n", " name: str\n", " description: str\n", " activities: list[str] # E.g., visiting, sightseeing, relaxing\n", "\n", "class City(BaseModel):\n", " name: str\n", " description: str\n", " landmarks: list[Landmark]\n", " cultural_highlights: list[str] # E.g., food, music, traditional crafts\n", "\n", "class TravelKnowledgeGraph(BaseModel):\n", " cities: list[City] # Central Mexican cities to visit\n", "\n", "async def combined_demo():\n", " # Define the knowledge graph extraction strategy\n", " strategy = LLMExtractionStrategy(\n", " # provider=\"ollama/nemotron\",\n", " provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models\n", " pi_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass \"no-token\"\n", " schema=TravelKnowledgeGraph.schema(),\n", " instruction=(\n", " \"Extract cities, landmarks, and cultural highlights for places to visit in Central Mexico. \"\n", " \"For each city, list main landmarks with descriptions and activities, as well as cultural highlights.\"\n", " )\n", " )\n", "\n", " # Set up the AsyncWebCrawler with multi-browser support, Magic Mode, and Fit Markdown\n", " async with AsyncWebCrawler(browser_type=\"firefox\") as crawler:\n", " result = await crawler.arun(\n", " url=\"https://janineintheworld.com/places-to-visit-in-central-mexico\",\n", " extraction_strategy=strategy,\n", " bypass_cache=True,\n", " magic=True\n", " )\n", " \n", " # Display main article content in Fit Markdown format\n", " print(\"Extracted Main Content:\\n\", result.fit_markdown)\n", " \n", " # Display extracted knowledge graph of cities, landmarks, and cultural highlights\n", " if result.extracted_content:\n", " travel_graph = json.loads(result.extracted_content)\n", " print(\"\\nExtracted Knowledge Graph:\\n\", json.dumps(travel_graph, indent=2))\n", "\n", "# Run the combined demo\n", "await combined_demo()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## 🤖 Crawl4AI GPT Assistant\n", "Chat with the Crawl4AI GPT Assistant for code generation, support, and learning Crawl4AI faster. Try it out [here](https://tinyurl.com/crawl4ai-gpt)!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.9" } }, "nbformat": 4, "nbformat_minor": 2 }