### Content Selection Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need. #### CSS Selectors Extract specific content using a `CrawlerRunConfig` with CSS selectors: ```python from crawl4ai.async_configs import CrawlerRunConfig config = CrawlerRunConfig(css_selector=".main-article") # Target main article content result = await crawler.arun(url="https://crawl4ai.com", config=config) config = CrawlerRunConfig(css_selector="article h1, article .content") # Target heading and content result = await crawler.arun(url="https://crawl4ai.com", config=config) ``` #### Content Filtering Control content inclusion or exclusion with `CrawlerRunConfig`: ```python config = CrawlerRunConfig( word_count_threshold=10, # Minimum words per block excluded_tags=['form', 'header', 'footer', 'nav'], # Excluded tags exclude_external_links=True, # Remove external links exclude_social_media_links=True, # Remove social media links exclude_external_images=True # Remove external images ) result = await crawler.arun(url="https://crawl4ai.com", config=config) ``` #### Iframe Content Process iframe content by enabling specific options in `CrawlerRunConfig`: ```python config = CrawlerRunConfig( process_iframes=True, # Extract iframe content remove_overlay_elements=True # Remove popups/modals that might block iframes ) result = await crawler.arun(url="https://crawl4ai.com", config=config) ``` #### Structured Content Selection Using LLMs Leverage LLMs for intelligent content extraction: ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel from typing import List class ArticleContent(BaseModel): title: str main_points: List[str] conclusion: str strategy = LLMExtractionStrategy( provider="ollama/nemotron", schema=ArticleContent.schema(), instruction="Extract the main article title, key points, and conclusion" ) config = CrawlerRunConfig(extraction_strategy=strategy) result = await crawler.arun(url="https://crawl4ai.com", config=config) article = json.loads(result.extracted_content) ``` #### Pattern-Based Selection Extract content matching repetitive patterns: ```python from crawl4ai.extraction_strategy import JsonCssExtractionStrategy schema = { "name": "News Articles", "baseSelector": "article.news-item", "fields": [ {"name": "headline", "selector": "h2", "type": "text"}, {"name": "summary", "selector": ".summary", "type": "text"}, {"name": "category", "selector": ".category", "type": "text"}, { "name": "metadata", "type": "nested", "fields": [ {"name": "author", "selector": ".author", "type": "text"}, {"name": "date", "selector": ".date", "type": "text"} ] } ] } strategy = JsonCssExtractionStrategy(schema) config = CrawlerRunConfig(extraction_strategy=strategy) result = await crawler.arun(url="https://crawl4ai.com", config=config) articles = json.loads(result.extracted_content) ``` #### Comprehensive Example Combine different selection methods using `CrawlerRunConfig`: ```python from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig async def extract_article_content(url: str): # Define structured extraction article_schema = { "name": "Article", "baseSelector": "article.main", "fields": [ {"name": "title", "selector": "h1", "type": "text"}, {"name": "content", "selector": ".content", "type": "text"} ] } # Define configuration config = CrawlerRunConfig( extraction_strategy=JsonCssExtractionStrategy(article_schema), word_count_threshold=10, excluded_tags=['nav', 'footer'], exclude_external_links=True ) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=url, config=config) return json.loads(result.extracted_content) ```