This function has multiple overloads
- Extract From Page or Locator
- Extract From Content
Extracts structured data from web pages using AI-powered content analysis.This function provides intelligent data extraction from web pages using various strategies
including HTML parsing, image analysis, and Markdown conversion. Or by using Text or Image Content.
It supports extraction from entire pages or specific elements, with built-in caching and retry mechanisms.Extract data from web pages or specific elements using HTML, IMAGE, or MARKDOWN strategies with DOM matching support.
Features and limitations
Features:- Smart Caching: Hashes inputs and uses KV Cache for persistent storage
- DOM Matching: With
enable_dom_matching=True, values match DOM elements for smart caching - Multiple Strategies: HTML, IMAGE, or MARKDOWN based on content type
- Flexible Models: Use any up-to-date model from anthropic, openai or google based on your needs.
- Model Variability: Quality varies by model - experiment to find the best fit
- DOM Complexity: Dynamic structures can affect caching and matching
- IMAGE Constraints: Cannot capture truncated or off-screen content
- Schema Design: Complex schemas may reduce accuracy
Examples
Arguments
Playwright Page object to extract data from the entire page or Locator object to extract data from a specific element.
Schema defining the structure of the data to extract. Can be either a Pydantic BaseModel class or a JSON Schema dictionary.
Optional prompt to guide the extraction process and provide more context. Defaults to None.
Type of extraction strategy:
- “HTML” (default) - Best for text-heavy pages with structured content
- “IMAGE” - Best for visual content, charts, or complex layouts
- “MARKDOWN” - Best for article-style content with semantic structure
Whether to enable DOM element matching during extraction. When enabled, all types in the schema must be strings to match with the DOM elements. Extraction results are mapped to their corresponding DOM elements and returned with matched results. These results are intelligently cached, allowing subsequent extractions with minor DOM changes to utilize the cached data for improved performance. Defaults to False.
Whether to enable caching of extraction results. Defaults to True.
Maximum number of retry attempts on failures. Failures can be validation errors, API errors, output errors, etc. Defaults to 3.
AI model to use for extraction. Defaults to “claude-haiku-4-5-20251001”.
Optional API key for AI extraction (if provided, will not be billed to your account). Defaults to None.