# CrawlResult The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage. ## Class Definition ```python class CrawlResult(BaseModel): """Result of a web crawling operation.""" # Basic Information url: str # Crawled URL success: bool # Whether crawl succeeded status_code: Optional[int] = None # HTTP status code error_message: Optional[str] = None # Error message if failed # Content html: str # Raw HTML content cleaned_html: Optional[str] = None # Cleaned HTML fit_html: Optional[str] = None # Most relevant HTML content markdown: Optional[str] = None # HTML converted to markdown fit_markdown: Optional[str] = None # Most relevant markdown content downloaded_files: Optional[List[str]] = None # Downloaded files # Extracted Data extracted_content: Optional[str] = None # Content from extraction strategy media: Dict[str, List[Dict]] = {} # Extracted media information links: Dict[str, List[Dict]] = {} # Extracted links metadata: Optional[dict] = None # Page metadata # Additional Data screenshot: Optional[str] = None # Base64 encoded screenshot session_id: Optional[str] = None # Session identifier response_headers: Optional[dict] = None # HTTP response headers ``` ## Properties and Their Data Structures ### Basic Information ```python # Access basic information result = await crawler.arun(url="https://example.com") print(result.url) # "https://example.com" print(result.success) # True/False print(result.status_code) # 200, 404, etc. print(result.error_message) # Error details if failed ``` ### Content Properties #### HTML Content ```python # Raw HTML html_content = result.html # Cleaned HTML (removed ads, popups, etc.) clean_content = result.cleaned_html # Most relevant HTML content main_content = result.fit_html ``` #### Markdown Content ```python # Full markdown version markdown_content = result.markdown # Most relevant markdown content main_content = result.fit_markdown ``` ### Media Content The media dictionary contains organized media elements: ```python # Structure media = { "images": [ { "src": str, # Image URL "alt": str, # Alt text "desc": str, # Contextual description "score": float, # Relevance score (0-10) "type": str, # "image" "width": int, # Image width (if available) "height": int, # Image height (if available) "context": str, # Surrounding text "lazy": bool # Whether image was lazy-loaded } ], "videos": [ { "src": str, # Video URL "type": str, # "video" "title": str, # Video title "poster": str, # Thumbnail URL "duration": str, # Video duration "description": str # Video description } ], "audios": [ { "src": str, # Audio URL "type": str, # "audio" "title": str, # Audio title "duration": str, # Audio duration "description": str # Audio description } ] } # Example usage for image in result.media["images"]: if image["score"] > 5: # High-relevance images print(f"High-quality image: {image['src']}") print(f"Context: {image['context']}") ``` ### Link Analysis The links dictionary organizes discovered links: ```python # Structure links = { "internal": [ { "href": str, # URL "text": str, # Link text "title": str, # Title attribute "type": str, # Link type (nav, content, etc.) "context": str, # Surrounding text "score": float # Relevance score } ], "external": [ { "href": str, # External URL "text": str, # Link text "title": str, # Title attribute "domain": str, # Domain name "type": str, # Link type "context": str # Surrounding text } ] } # Example usage for link in result.links["internal"]: print(f"Internal link: {link['href']}") print(f"Context: {link['context']}") ``` ### Metadata The metadata dictionary contains page information: ```python # Structure metadata = { "title": str, # Page title "description": str, # Meta description "keywords": List[str], # Meta keywords "author": str, # Author information "published_date": str, # Publication date "modified_date": str, # Last modified date "language": str, # Page language "canonical_url": str, # Canonical URL "og_data": Dict, # Open Graph data "twitter_data": Dict # Twitter card data } # Example usage if result.metadata: print(f"Title: {result.metadata['title']}") print(f"Author: {result.metadata.get('author', 'Unknown')}") ``` ### Extracted Content Content from extraction strategies: ```python # For LLM or CSS extraction strategies if result.extracted_content: structured_data = json.loads(result.extracted_content) print(structured_data) ``` ### Screenshot Base64 encoded screenshot: ```python # Save screenshot if available if result.screenshot: import base64 # Decode and save with open("screenshot.png", "wb") as f: f.write(base64.b64decode(result.screenshot)) ``` ## Usage Examples ### Basic Content Access ```python async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com") if result.success: # Get clean content print(result.fit_markdown) # Process images for image in result.media["images"]: if image["score"] > 7: print(f"High-quality image: {image['src']}") ``` ### Complete Data Processing ```python async def process_webpage(url: str) -> Dict: async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=url) if not result.success: raise Exception(f"Crawl failed: {result.error_message}") return { "content": result.fit_markdown, "images": [ img for img in result.media["images"] if img["score"] > 5 ], "internal_links": [ link["href"] for link in result.links["internal"] ], "metadata": result.metadata, "status": result.status_code } ``` ### Error Handling ```python async def safe_crawl(url: str) -> Dict: async with AsyncWebCrawler() as crawler: try: result = await crawler.arun(url=url) if not result.success: return { "success": False, "error": result.error_message, "status": result.status_code } return { "success": True, "content": result.fit_markdown, "status": result.status_code } except Exception as e: return { "success": False, "error": str(e), "status": None } ``` ## Best Practices 1. **Always Check Success** ```python if not result.success: print(f"Error: {result.error_message}") return ``` 2. **Use fit_markdown for Articles** ```python # Better for article content content = result.fit_markdown if result.fit_markdown else result.markdown ``` 3. **Filter Media by Score** ```python relevant_images = [ img for img in result.media["images"] if img["score"] > 5 ] ``` 4. **Handle Missing Data** ```python metadata = result.metadata or {} title = metadata.get('title', 'Unknown Title') ```