extract_structured_data

This function uses AI and incurs costs.

This function has multiple overloads

Extract From Page or Locator
Extract From Content

Extracts structured data from web pages using AI-powered content analysis.This function provides intelligent data extraction from web pages using various strategies including HTML parsing, image analysis, and Markdown conversion. Or by using Text or Image Content. It supports extraction from entire pages or specific elements, with built-in caching and retry mechanisms.

async def extract_structured_data(
    *,
    source: Page | Locator,
    data_schema: type[BaseModel] | dict[str, Any],
    prompt: str | None,
    strategy: Literal['IMAGE', 'MARKDOWN', 'HTML'],
    model: str,
    api_key: str | None,
    enable_dom_matching: bool | None,
    enable_cache: bool | None,
    max_retries: int | None,
) -> Any

Extract data from web pages or specific elements using HTML, IMAGE, or MARKDOWN strategies with DOM matching support.

Features and limitations

Features:

Smart Caching: Hashes inputs and uses KV Cache for persistent storage
DOM Matching: With enable_dom_matching=True, values match DOM elements for smart caching
Multiple Strategies: HTML, IMAGE, or MARKDOWN based on content type
Flexible Models: Use any up-to-date model from anthropic, openai or google based on your needs.

Limitations:

Model Variability: Quality varies by model - experiment to find the best fit
DOM Complexity: Dynamic structures can affect caching and matching
IMAGE Constraints: Cannot capture truncated or off-screen content
Schema Design: Complex schemas may reduce accuracy

Examples

from typing import TypedDict
from playwright.async_api import Page
from intuned_browser.ai import extract_structured_data
class Params(TypedDict):
    pass
async def automation(page: Page, params: Params, **_kwargs):
    await page.goto('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')
    # This will extract the book details from the page, using the HTML strategy with the gpt-4o model.
    # The data_schema is a JSON Schema dictionary that defines the structure of the data to extract.
    # You can also use a Pydantic BaseModel instead of a JSON Schema dictionary.
    book = await extract_structured_data(
        source=page,
        strategy="HTML",  # The HTML strategy is the default strategy and will be used if no strategy is provided.
        model="gpt-4o",
        data_schema={
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "string"},
                "description": {"type": "string"},
                "in_stock": {"type": "string"}, # Must be string because we enabled DOM matching
                "rating": {"type": "string"}
            },
            "required": ["name", "price"]
        },
        prompt="Extract book details from this page",
        enable_cache=True,  # since this is True, the method will call AI for the first time, and then whenever you call this method it will return cached results as long as the DOM is the same.
        enable_dom_matching=True,  # since this is True, the method will return the results mapped to the DOM elements, you MUST enable cache for this to work.
        max_retries=3
    )
    print(f"Found book: {book['name']} - {book['price']}")

Arguments

source

Page | Locator

required

Playwright Page object to extract data from the entire page or Locator object to extract data from a specific element.

data_schema

type[BaseModel] | dict[str, Any]

required

Schema defining the structure of the data to extract. Can be either a Pydantic BaseModel class or a JSON Schema dictionary.

prompt

str

Optional prompt to guide the extraction process and provide more context. Defaults to None.

strategy

Literal['IMAGE', 'MARKDOWN', 'HTML']

Type of extraction strategy:

“HTML” (default) - Best for text-heavy pages with structured content
“IMAGE” - Best for visual content, charts, or complex layouts
“MARKDOWN” - Best for article-style content with semantic structure

enable_dom_matching

bool

Whether to enable DOM element matching during extraction. When enabled, all types in the schema must be strings to match with the DOM elements. Extraction results are mapped to their corresponding DOM elements and returned with matched results. These results are intelligently cached, allowing subsequent extractions with minor DOM changes to utilize the cached data for improved performance. Defaults to False.

enable_cache

bool

Whether to enable caching of extraction results. Defaults to True.

max_retries

int

Maximum number of retry attempts on failures. Failures can be validation errors, API errors, output errors, etc. Defaults to 3.

model

str

AI model to use for extraction. Defaults to “claude-haiku-4-5-20251001”.

api_key

str

Optional API key for AI extraction (if provided, will not be billed to your account). Defaults to None.

async def extract_structured_data(
    *,
    content: list[ContentItem] | ContentItem,
    data_schema: type[BaseModel] | dict[str, Any],
    prompt: str | None,
    max_retries: int | None,
    enable_cache: bool | None,
    model: str,
    api_key: str | None,
) -> Any

Extract data from text, image buffers, or image URLs without requiring a page source.

Features and limitations

Features:

Smart Caching: Hashes content and uses KV Cache for persistent storage
Multiple Content Items: Combine text, images (buffer or URL) for comprehensive extraction
Flexible Models: Use any up-to-date model from anthropic, openai or google based on your needs.

Limitations:

Model Variability: Quality varies by model - experiment to find the best fit
Schema Design: Complex schemas may reduce accuracy
Content Quality: Requires meaningful, contextual content for accurate extraction - sparse or ambiguous content produces poor results

Examples

from typing import TypedDict
from playwright.async_api import Page
from intuned_browser.ai import extract_structured_data, TextContentItem
class Params(TypedDict):
    pass
async def automation(page: Page, params: Params, **_kwargs):
    # This will extract the person information from the text, using the gpt-4o model.
    text_content: TextContentItem = {
        "type": "text",
        "data": "John Doe, age 30, works as a Software Engineer at Tech Corp"
    }
    person = await extract_structured_data(
        content=text_content,
        model="gpt-4o",
        data_schema={
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "number"},
                "occupation": {"type": "string"},
                "company": {"type": "string"}
            },
            "required": ["name"]
        },
        prompt="Extract person information from the text"
    )
    print(f"Found person: {person['name']}, {person['age']} years old")

Arguments

content

list[ContentItem] | ContentItem

required

Content to extract data from - can be a single content item or array of ContentItem.

data_schema

type[BaseModel] | dict[str, Any]

required

Schema defining the expected structure of the extracted data. Can be either a Pydantic BaseModel class or a JSON Schema dictionary.

prompt

str

Optional prompt to guide the extraction process and provide more context. Defaults to None.

max_retries

int

Maximum number of retry attempts on failures. Failures can be validation errors, API errors, output errors, etc. Defaults to 3.

enable_cache

bool

Whether to enable caching of the extracted data. Defaults to True.

model

str

AI model to use for extraction. Defaults to “claude-haiku-4-5-20251001”.

api_key

str

Optional API key for AI extraction (if provided, will not be billed to your account). Defaults to None.

Returns: `Any`

The extracted structured data conforming to the provided schema.

Introduction

Typescript SDK

Python SDK

extract_structured_data

Features and limitations

Examples

Arguments

Features and limitations

Examples

Arguments

Returns: `Any`

Introduction

Typescript SDK

Python SDK

​Features and limitations

​Examples

​Arguments

​Features and limitations

​Examples

​Arguments

​Returns: Any

Features and limitations

Examples

Arguments

Features and limitations

Examples

Arguments

Returns: `Any`