Spaces:

re-mind
/

Crawl4AI

Running

App Files Files Community

Crawl4AI / docs /md_v2 /basic /simple-crawling.md

amaye15

test

03c0888 about 1 month ago

preview code

raw

history blame contribute delete

4.24 kB

	# Simple Crawling

	This guide covers the basics of web crawling with Crawl4AI. You'll learn how to set up a crawler, make your first request, and understand the response.

	## Basic Usage

	Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:

	```python
	import asyncio
	from crawl4ai import AsyncWebCrawler
	from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

	async def main():
	browser_config = BrowserConfig() # Default browser configuration
	run_config = CrawlerRunConfig() # Default crawl run configuration

	async with AsyncWebCrawler(config=browser_config) as crawler:
	result = await crawler.arun(
	url="https://example.com",
	config=run_config
	)
	print(result.markdown) # Print clean markdown content

	if __name__ == "__main__":
	asyncio.run(main())
	```

	## Understanding the Response

	The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):

	```python
	result = await crawler.arun(
	url="https://example.com",
	config=CrawlerRunConfig(fit_markdown=True)
	)

	# Different content formats
	print(result.html) # Raw HTML
	print(result.cleaned_html) # Cleaned HTML
	print(result.markdown) # Markdown version
	print(result.fit_markdown) # Most relevant content in markdown

	# Check success status
	print(result.success) # True if crawl succeeded
	print(result.status_code) # HTTP status code (e.g., 200, 404)

	# Access extracted media and links
	print(result.media) # Dictionary of found media (images, videos, audio)
	print(result.links) # Dictionary of internal and external links
	```

	## Adding Basic Options

	Customize your crawl using `CrawlerRunConfig`:

	```python
	run_config = CrawlerRunConfig(
	word_count_threshold=10, # Minimum words per content block
	exclude_external_links=True, # Remove external links
	remove_overlay_elements=True, # Remove popups/modals
	process_iframes=True # Process iframe content
	)

	result = await crawler.arun(
	url="https://example.com",
	config=run_config
	)
	```

	## Handling Errors

	Always check if the crawl was successful:

	```python
	run_config = CrawlerRunConfig()
	result = await crawler.arun(url="https://example.com", config=run_config)

	if not result.success:
	print(f"Crawl failed: {result.error_message}")
	print(f"Status code: {result.status_code}")
	```

	## Logging and Debugging

	Enable verbose logging in `BrowserConfig`:

	```python
	browser_config = BrowserConfig(verbose=True)

	async with AsyncWebCrawler(config=browser_config) as crawler:
	run_config = CrawlerRunConfig()
	result = await crawler.arun(url="https://example.com", config=run_config)
	```

	## Complete Example

	Here's a more comprehensive example demonstrating common usage patterns:

	```python
	import asyncio
	from crawl4ai import AsyncWebCrawler
	from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode

	async def main():
	browser_config = BrowserConfig(verbose=True)
	run_config = CrawlerRunConfig(
	# Content filtering
	word_count_threshold=10,
	excluded_tags=['form', 'header'],
	exclude_external_links=True,

	# Content processing
	process_iframes=True,
	remove_overlay_elements=True,

	# Cache control
	cache_mode=CacheMode.ENABLED # Use cache if available
	)

	async with AsyncWebCrawler(config=browser_config) as crawler:
	result = await crawler.arun(
	url="https://example.com",
	config=run_config
	)

	if result.success:
	# Print clean content
	print("Content:", result.markdown[:500]) # First 500 chars

	# Process images
	for image in result.media["images"]:
	print(f"Found image: {image['src']}")

	# Process links
	for link in result.links["internal"]:
	print(f"Internal link: {link['href']}")

	else:
	print(f"Crawl failed: {result.error_message}")

	if __name__ == "__main__":
	asyncio.run(main())
	```