Spaces:
Running
Running
# Simple Crawling | |
This guide covers the basics of web crawling with Crawl4AI. You'll learn how to set up a crawler, make your first request, and understand the response. | |
## Basic Usage | |
Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`: | |
```python | |
import asyncio | |
from crawl4ai import AsyncWebCrawler | |
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig | |
async def main(): | |
browser_config = BrowserConfig() # Default browser configuration | |
run_config = CrawlerRunConfig() # Default crawl run configuration | |
async with AsyncWebCrawler(config=browser_config) as crawler: | |
result = await crawler.arun( | |
url="https://example.com", | |
config=run_config | |
) | |
print(result.markdown) # Print clean markdown content | |
if __name__ == "__main__": | |
asyncio.run(main()) | |
``` | |
## Understanding the Response | |
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details): | |
```python | |
result = await crawler.arun( | |
url="https://example.com", | |
config=CrawlerRunConfig(fit_markdown=True) | |
) | |
# Different content formats | |
print(result.html) # Raw HTML | |
print(result.cleaned_html) # Cleaned HTML | |
print(result.markdown) # Markdown version | |
print(result.fit_markdown) # Most relevant content in markdown | |
# Check success status | |
print(result.success) # True if crawl succeeded | |
print(result.status_code) # HTTP status code (e.g., 200, 404) | |
# Access extracted media and links | |
print(result.media) # Dictionary of found media (images, videos, audio) | |
print(result.links) # Dictionary of internal and external links | |
``` | |
## Adding Basic Options | |
Customize your crawl using `CrawlerRunConfig`: | |
```python | |
run_config = CrawlerRunConfig( | |
word_count_threshold=10, # Minimum words per content block | |
exclude_external_links=True, # Remove external links | |
remove_overlay_elements=True, # Remove popups/modals | |
process_iframes=True # Process iframe content | |
) | |
result = await crawler.arun( | |
url="https://example.com", | |
config=run_config | |
) | |
``` | |
## Handling Errors | |
Always check if the crawl was successful: | |
```python | |
run_config = CrawlerRunConfig() | |
result = await crawler.arun(url="https://example.com", config=run_config) | |
if not result.success: | |
print(f"Crawl failed: {result.error_message}") | |
print(f"Status code: {result.status_code}") | |
``` | |
## Logging and Debugging | |
Enable verbose logging in `BrowserConfig`: | |
```python | |
browser_config = BrowserConfig(verbose=True) | |
async with AsyncWebCrawler(config=browser_config) as crawler: | |
run_config = CrawlerRunConfig() | |
result = await crawler.arun(url="https://example.com", config=run_config) | |
``` | |
## Complete Example | |
Here's a more comprehensive example demonstrating common usage patterns: | |
```python | |
import asyncio | |
from crawl4ai import AsyncWebCrawler | |
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode | |
async def main(): | |
browser_config = BrowserConfig(verbose=True) | |
run_config = CrawlerRunConfig( | |
# Content filtering | |
word_count_threshold=10, | |
excluded_tags=['form', 'header'], | |
exclude_external_links=True, | |
# Content processing | |
process_iframes=True, | |
remove_overlay_elements=True, | |
# Cache control | |
cache_mode=CacheMode.ENABLED # Use cache if available | |
) | |
async with AsyncWebCrawler(config=browser_config) as crawler: | |
result = await crawler.arun( | |
url="https://example.com", | |
config=run_config | |
) | |
if result.success: | |
# Print clean content | |
print("Content:", result.markdown[:500]) # First 500 chars | |
# Process images | |
for image in result.media["images"]: | |
print(f"Found image: {image['src']}") | |
# Process links | |
for link in result.links["internal"]: | |
print(f"Internal link: {link['href']}") | |
else: | |
print(f"Crawl failed: {result.error_message}") | |
if __name__ == "__main__": | |
asyncio.run(main()) | |
``` | |