File size: 7,785 Bytes
03c0888
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🚀 Crawl4AI v0.3.72 Release Announcement\n",
    "\n",
    "Welcome to the new release of **Crawl4AI v0.3.72**! This notebook highlights the latest features and demonstrates how they work in real-time. Follow along to see each feature in action!\n",
    "\n",
    "### What’s New?\n",
    "- ✨ `Fit Markdown`: Extracts only the main content from articles and blogs\n",
    "- 🛡️ **Magic Mode**: Comprehensive anti-bot detection bypass\n",
    "- 🌐 **Multi-browser support**: Switch between Chromium, Firefox, WebKit\n",
    "- 🔍 **Knowledge Graph Extraction**: Generate structured graphs of entities & relationships from any URL\n",
    "- 🤖 **Crawl4AI GPT Assistant**: Chat directly with our AI assistant for help, code generation, and faster learning (available [here](https://tinyurl.com/your-gpt-assistant-link))\n",
    "\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📥 Setup\n",
    "To start, we'll install `Crawl4AI` along with Playwright and `nest_asyncio` to ensure compatibility with Colab’s asynchronous environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install Crawl4AI and dependencies\n",
    "!pip install crawl4ai\n",
    "!playwright install\n",
    "!pip install nest_asyncio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import nest_asyncio and apply it to allow asyncio in Colab\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()\n",
    "\n",
    "print('Setup complete!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## ✨ Feature 1: `Fit Markdown`\n",
    "Extracts only the main content from articles and blog pages, removing sidebars, ads, and other distractions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "from crawl4ai import AsyncWebCrawler\n",
    "\n",
    "async def fit_markdown_demo():\n",
    "    async with AsyncWebCrawler() as crawler:\n",
    "        result = await crawler.arun(url=\"https://janineintheworld.com/places-to-visit-in-central-mexico\")\n",
    "        print(result.fit_markdown)  # Shows main content in Markdown format\n",
    "\n",
    "# Run the demo\n",
    "await fit_markdown_demo()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 🛡️ Feature 2: Magic Mode\n",
    "Magic Mode bypasses anti-bot detection to make crawling more reliable on protected websites.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def magic_mode_demo():\n",
    "    async with AsyncWebCrawler() as crawler:  # Enables anti-bot detection bypass\n",
    "        result = await crawler.arun(\n",
    "            url=\"https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/\",\n",
    "            magic=True  # Enables magic mode\n",
    "        )\n",
    "        print(result.markdown)  # Shows the full content in Markdown format\n",
    "\n",
    "# Run the demo\n",
    "await magic_mode_demo()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 🌐 Feature 3: Multi-Browser Support\n",
    "Crawl4AI now supports Chromium, Firefox, and WebKit. Here’s how to specify Firefox for a crawl.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def multi_browser_demo():\n",
    "    async with AsyncWebCrawler(browser_type=\"firefox\") as crawler:  # Using Firefox instead of default Chromium\n",
    "        result = await crawler.arun(url=\"https://crawl4i.com\")\n",
    "        print(result.markdown)  # Shows content extracted using Firefox\n",
    "\n",
    "# Run the demo\n",
    "await multi_browser_demo()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## ✨ Put them all together\n",
    "\n",
    "Let's combine all the features to extract the main content from a blog post, bypass anti-bot detection, and generate a knowledge graph from the extracted content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from crawl4ai.extraction_strategy import LLMExtractionStrategy\n",
    "from pydantic import BaseModel\n",
    "import json, os\n",
    "from typing import List\n",
    "\n",
    "# Define classes for the knowledge graph structure\n",
    "class Landmark(BaseModel):\n",
    "    name: str\n",
    "    description: str\n",
    "    activities: list[str]  # E.g., visiting, sightseeing, relaxing\n",
    "\n",
    "class City(BaseModel):\n",
    "    name: str\n",
    "    description: str\n",
    "    landmarks: list[Landmark]\n",
    "    cultural_highlights: list[str]  # E.g., food, music, traditional crafts\n",
    "\n",
    "class TravelKnowledgeGraph(BaseModel):\n",
    "    cities: list[City]  # Central Mexican cities to visit\n",
    "\n",
    "async def combined_demo():\n",
    "    # Define the knowledge graph extraction strategy\n",
    "    strategy = LLMExtractionStrategy(\n",
    "        # provider=\"ollama/nemotron\",\n",
    "        provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models\n",
    "        pi_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass \"no-token\"\n",
    "        schema=TravelKnowledgeGraph.schema(),\n",
    "        instruction=(\n",
    "            \"Extract cities, landmarks, and cultural highlights for places to visit in Central Mexico. \"\n",
    "            \"For each city, list main landmarks with descriptions and activities, as well as cultural highlights.\"\n",
    "        )\n",
    "    )\n",
    "\n",
    "    # Set up the AsyncWebCrawler with multi-browser support, Magic Mode, and Fit Markdown\n",
    "    async with AsyncWebCrawler(browser_type=\"firefox\") as crawler:\n",
    "        result = await crawler.arun(\n",
    "            url=\"https://janineintheworld.com/places-to-visit-in-central-mexico\",\n",
    "            extraction_strategy=strategy,\n",
    "            bypass_cache=True,\n",
    "            magic=True\n",
    "        )\n",
    "        \n",
    "        # Display main article content in Fit Markdown format\n",
    "        print(\"Extracted Main Content:\\n\", result.fit_markdown)\n",
    "        \n",
    "        # Display extracted knowledge graph of cities, landmarks, and cultural highlights\n",
    "        if result.extracted_content:\n",
    "            travel_graph = json.loads(result.extracted_content)\n",
    "            print(\"\\nExtracted Knowledge Graph:\\n\", json.dumps(travel_graph, indent=2))\n",
    "\n",
    "# Run the combined demo\n",
    "await combined_demo()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 🤖 Crawl4AI GPT Assistant\n",
    "Chat with the Crawl4AI GPT Assistant for code generation, support, and learning Crawl4AI faster. Try it out [here](https://tinyurl.com/crawl4ai-gpt)!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}