# CrawlerRunConfig Parameters Documentation ## Content Processing Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `word_count_threshold` | int | 200 | Minimum word count threshold before processing content | | `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy | | `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction | | `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content | | `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content | | `only_text` | bool | False | If True, attempt to extract text-only content where applicable | | `css_selector` | str | None | CSS selector to extract a specific portion of the page | | `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing | | `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes | | `remove_forms` | bool | False | If True, remove all `
` elements from the HTML | | `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output | ## Caching Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally | | `session_id` | str | None | Optional session ID to persist browser context and page instance | | `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS | | `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED | | `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY | | `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY | ## Page Navigation and Timing Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating | | `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation | | `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content | | `wait_for_images` | bool | True | If True, wait for images to load before extracting content | | `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML | | `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many | | `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many | | `semaphore_count` | int | 5 | Number of concurrent operations allowed | ## Page Interaction Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page | | `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates | | `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding | | `scan_full_page` | bool | False | If True, scroll through the entire page to load all content | | `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True | | `process_iframes` | bool | False | If True, attempts to process and inline iframe content | | `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML | | `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures | | `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior | | `magic` | bool | False | If True, attempts automatic handling of overlays/popups | | `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions | ## Media Handling Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `screenshot` | bool | False | Whether to take a screenshot after crawling | | `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot | | `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy | | `pdf` | bool | False | Whether to generate a PDF of the page | | `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction | | `image_score_threshold` | int | 3 | Minimum score threshold for processing an image | | `exclude_external_images` | bool | False | If True, exclude all external images from processing | ## Link and Domain Handling Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links | | `exclude_external_links` | bool | False | If True, exclude all external links from the results | | `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains | | `exclude_domains` | list[str] | [] | List of specific domains to exclude from results | ## Debugging and Logging Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `verbose` | bool | True | Enable verbose logging | | `log_console` | bool | False | If True, log console messages from the page |