|
# Crawl4AI |
|
|
|
Welcome to the official documentation for Crawl4AI! π·οΈπ€ Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI. |
|
|
|
## Introduction |
|
|
|
Crawl4AI has one clear task: to make crawling and data extraction from web pages easy and efficient, especially for large language models (LLMs) and AI applications. Whether you are using it as a REST API or a Python library, Crawl4AI offers a robust and flexible solution with full asynchronous support. |
|
|
|
## Quick Start |
|
|
|
Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities: |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler |
|
|
|
async def main(): |
|
# Create an instance of AsyncWebCrawler |
|
async with AsyncWebCrawler(verbose=True) as crawler: |
|
# Run the crawler on a URL |
|
result = await crawler.arun(url="https://www.nbcnews.com/business") |
|
|
|
# Print the extracted content |
|
print(result.markdown) |
|
|
|
# Run the async main function |
|
asyncio.run(main()) |
|
``` |
|
|
|
## Key Features β¨ |
|
|
|
- π Completely free and open-source |
|
- π Blazing fast performance, outperforming many paid services |
|
- π€ LLM-friendly output formats (JSON, cleaned HTML, markdown) |
|
- π Fit markdown generation for extracting main article content. |
|
- π Multi-browser support (Chromium, Firefox, WebKit) |
|
- π Supports crawling multiple URLs simultaneously |
|
- π¨ Extracts and returns all media tags (Images, Audio, and Video) |
|
- π Extracts all external and internal links |
|
- π Extracts metadata from the page |
|
- π Custom hooks for authentication, headers, and page modifications |
|
- π΅οΈ User-agent customization |
|
- πΌοΈ Takes screenshots of pages with enhanced error handling |
|
- π Executes multiple custom JavaScripts before crawling |
|
- π Generates structured output without LLM using JsonCssExtractionStrategy |
|
- π Various chunking strategies: topic-based, regex, sentence, and more |
|
- π§ Advanced extraction strategies: cosine clustering, LLM, and more |
|
- π― CSS selector support for precise data extraction |
|
- π Passes instructions/keywords to refine extraction |
|
- π Proxy support with authentication for enhanced access |
|
- π Session management for complex multi-page crawling |
|
- π Asynchronous architecture for improved performance |
|
- πΌοΈ Improved image processing with lazy-loading detection |
|
- π°οΈ Enhanced handling of delayed content loading |
|
- π Custom headers support for LLM interactions |
|
- πΌοΈ iframe content extraction for comprehensive analysis |
|
- β±οΈ Flexible timeout and delayed content retrieval options |
|
|
|
## Documentation Structure |
|
|
|
Our documentation is organized into several sections: |
|
|
|
### Basic Usage |
|
- [Installation](basic/installation.md) |
|
- [Quick Start](basic/quickstart.md) |
|
- [Simple Crawling](basic/simple-crawling.md) |
|
- [Browser Configuration](basic/browser-config.md) |
|
- [Content Selection](basic/content-selection.md) |
|
- [Output Formats](basic/output-formats.md) |
|
- [Page Interaction](basic/page-interaction.md) |
|
|
|
### Advanced Features |
|
- [Magic Mode](advanced/magic-mode.md) |
|
- [Session Management](advanced/session-management.md) |
|
- [Hooks & Authentication](advanced/hooks-auth.md) |
|
- [Proxy & Security](advanced/proxy-security.md) |
|
- [Content Processing](advanced/content-processing.md) |
|
|
|
### Extraction & Processing |
|
- [Extraction Strategies Overview](extraction/overview.md) |
|
- [LLM Integration](extraction/llm.md) |
|
- [CSS-Based Extraction](extraction/css.md) |
|
- [Cosine Strategy](extraction/cosine.md) |
|
- [Chunking Strategies](extraction/chunking.md) |
|
|
|
### API Reference |
|
- [AsyncWebCrawler](api/async-webcrawler.md) |
|
- [CrawlResult](api/crawl-result.md) |
|
- [Extraction Strategies](api/strategies.md) |
|
- [arun() Method Parameters](api/arun.md) |
|
|
|
### Examples |
|
- Coming soon! |
|
|
|
## Getting Started |
|
|
|
1. Install Crawl4AI: |
|
```bash |
|
pip install crawl4ai |
|
``` |
|
|
|
2. Check out our [Quick Start Guide](basic/quickstart.md) to begin crawling web pages. |
|
|
|
3. Explore our [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) to see Crawl4AI in action. |
|
|
|
## Support |
|
|
|
For questions, suggestions, or issues: |
|
- GitHub Issues: [Report a Bug](https://github.com/unclecode/crawl4ai/issues) |
|
- Twitter: [@unclecode](https://twitter.com/unclecode) |
|
- Website: [crawl4ai.com](https://crawl4ai.com) |
|
|
|
Happy Crawling! πΈοΈπ |