# AsyncWebCrawler The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options. ## Constructor ```python AsyncWebCrawler( # Browser Settings browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit" headless: bool = True, # Run browser in headless mode verbose: bool = False, # Enable verbose logging # Cache Settings always_by_pass_cache: bool = False, # Always bypass cache base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache # Network Settings proxy: str = None, # Simple proxy URL proxy_config: Dict = None, # Advanced proxy configuration # Browser Behavior sleep_on_close: bool = False, # Wait before closing browser # Custom Settings user_agent: str = None, # Custom user agent headers: Dict[str, str] = {}, # Custom HTTP headers js_code: Union[str, List[str]] = None, # Default JavaScript to execute ) ``` ### Parameters in Detail #### Browser Settings - **browser_type** (str, optional) - Default: `"chromium"` - Options: `"chromium"`, `"firefox"`, `"webkit"` - Controls which browser engine to use ```python # Example: Using Firefox crawler = AsyncWebCrawler(browser_type="firefox") ``` - **headless** (bool, optional) - Default: `True` - When `True`, browser runs without GUI - Set to `False` for debugging ```python # Visible browser for debugging crawler = AsyncWebCrawler(headless=False) ``` - **verbose** (bool, optional) - Default: `False` - Enables detailed logging ```python # Enable detailed logging crawler = AsyncWebCrawler(verbose=True) ``` #### Cache Settings - **always_by_pass_cache** (bool, optional) - Default: `False` - When `True`, always fetches fresh content ```python # Always fetch fresh content crawler = AsyncWebCrawler(always_by_pass_cache=True) ``` - **base_directory** (str, optional) - Default: User's home directory - Base path for cache storage ```python # Custom cache directory crawler = AsyncWebCrawler(base_directory="/path/to/cache") ``` #### Network Settings - **proxy** (str, optional) - Simple proxy URL ```python # Using simple proxy crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080") ``` - **proxy_config** (Dict, optional) - Advanced proxy configuration with authentication ```python # Advanced proxy with auth crawler = AsyncWebCrawler(proxy_config={ "server": "http://proxy.example.com:8080", "username": "user", "password": "pass" }) ``` #### Browser Behavior - **sleep_on_close** (bool, optional) - Default: `False` - Adds delay before closing browser ```python # Wait before closing crawler = AsyncWebCrawler(sleep_on_close=True) ``` #### Custom Settings - **user_agent** (str, optional) - Custom user agent string ```python # Custom user agent crawler = AsyncWebCrawler( user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0" ) ``` - **headers** (Dict[str, str], optional) - Custom HTTP headers ```python # Custom headers crawler = AsyncWebCrawler( headers={ "Accept-Language": "en-US", "Custom-Header": "Value" } ) ``` - **js_code** (Union[str, List[str]], optional) - Default JavaScript to execute on each page ```python # Default JavaScript crawler = AsyncWebCrawler( js_code=[ "window.scrollTo(0, document.body.scrollHeight);", "document.querySelector('.load-more').click();" ] ) ``` ## Methods ### arun() The primary method for crawling web pages. ```python async def arun( # Required url: str, # URL to crawl # Content Selection css_selector: str = None, # CSS selector for content word_count_threshold: int = 10, # Minimum words per block # Cache Control bypass_cache: bool = False, # Bypass cache for this request # Session Management session_id: str = None, # Session identifier # Screenshot Options screenshot: bool = False, # Take screenshot screenshot_wait_for: float = None, # Wait before screenshot # Content Processing process_iframes: bool = False, # Process iframe content remove_overlay_elements: bool = False, # Remove popups/modals # Anti-Bot Settings simulate_user: bool = False, # Simulate human behavior override_navigator: bool = False, # Override navigator properties magic: bool = False, # Enable all anti-detection # Content Filtering excluded_tags: List[str] = None, # HTML tags to exclude exclude_external_links: bool = False, # Remove external links exclude_social_media_links: bool = False, # Remove social media links # JavaScript Handling js_code: Union[str, List[str]] = None, # JavaScript to execute wait_for: str = None, # Wait condition # Page Loading page_timeout: int = 60000, # Page load timeout (ms) delay_before_return_html: float = None, # Wait before return # Extraction extraction_strategy: ExtractionStrategy = None # Extraction strategy ) -> CrawlResult: ``` ### Usage Examples #### Basic Crawling ```python async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com") ``` #### Advanced Crawling ```python async with AsyncWebCrawler( browser_type="firefox", verbose=True, headers={"Custom-Header": "Value"} ) as crawler: result = await crawler.arun( url="https://example.com", css_selector=".main-content", word_count_threshold=20, process_iframes=True, magic=True, wait_for="css:.dynamic-content", screenshot=True ) ``` #### Session Management ```python async with AsyncWebCrawler() as crawler: # First request result1 = await crawler.arun( url="https://example.com/login", session_id="my_session" ) # Subsequent request using same session result2 = await crawler.arun( url="https://example.com/protected", session_id="my_session" ) ``` ## Context Manager AsyncWebCrawler implements the async context manager protocol: ```python async def __aenter__(self) -> 'AsyncWebCrawler': # Initialize browser and resources return self async def __aexit__(self, *args): # Cleanup resources pass ``` Always use AsyncWebCrawler with async context manager: ```python async with AsyncWebCrawler() as crawler: # Your crawling code here pass ``` ## Best Practices 1. **Resource Management** ```python # Always use context manager async with AsyncWebCrawler() as crawler: # Crawler will be properly cleaned up pass ``` 2. **Error Handling** ```python try: async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com") if not result.success: print(f"Crawl failed: {result.error_message}") except Exception as e: print(f"Error: {str(e)}") ``` 3. **Performance Optimization** ```python # Enable caching for better performance crawler = AsyncWebCrawler( always_by_pass_cache=False, verbose=True ) ``` 4. **Anti-Detection** ```python # Maximum stealth crawler = AsyncWebCrawler( headless=True, user_agent="Mozilla/5.0...", headers={"Accept-Language": "en-US"} ) result = await crawler.arun( url="https://example.com", magic=True, simulate_user=True ) ``` ## Note on Browser Types Each browser type has its characteristics: - **chromium**: Best overall compatibility - **firefox**: Good for specific use cases - **webkit**: Lighter weight, good for basic crawling Choose based on your specific needs: ```python # High compatibility crawler = AsyncWebCrawler(browser_type="chromium") # Memory efficient crawler = AsyncWebCrawler(browser_type="webkit") ```