# extract_structured_data
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/ai/functions/extract_structured_data
This function uses AI and incurs costs.
This function has multiple overloads
Extracts structured data from web pages using AI-powered content analysis.
This function provides intelligent data extraction from web pages using various strategies
including HTML parsing, image analysis, and Markdown conversion. Or by using Text or Image Content.
It supports extraction from entire pages or specific elements, with built-in caching and retry mechanisms.
```python theme={null}
async def extract_structured_data(
*,
source: Page | Locator,
data_schema: type[BaseModel] | dict[str, Any],
prompt: str | None,
strategy: Literal['IMAGE', 'MARKDOWN', 'HTML'],
model: str,
api_key: str | None,
enable_dom_matching: bool | None,
enable_cache: bool | None,
max_retires: int | None,
) -> Any
```
Extract data from web pages or specific elements using HTML, IMAGE, or MARKDOWN strategies with DOM matching support.
## Features and limitations
**Features:**
* **Smart Caching:** Hashes inputs and uses [KV Cache](https://docs.intunedhq.com/docs/01-learn/recipes/kv-cache) for persistent storage
* **DOM Matching:** With `enable_dom_matching=True`, values match DOM elements for smart caching
* **Multiple Strategies:** HTML, IMAGE, or MARKDOWN based on content type
* **Flexible Models:** Use any up-to-date model from anthropic, openai or google based on your needs.
**Limitations:**
* **Model Variability:** Quality varies by model - experiment to find the best fit
* **DOM Complexity:** Dynamic structures can affect caching and matching
* **IMAGE Constraints:** Cannot capture truncated or off-screen content
* **Schema Design:** Complex schemas may reduce accuracy
## Examples
```python Extract book details theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser.ai import extract_structured_data
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')
# This will extract the book details from the page, using the HTML strategy with the gpt-4o model.
# The data_schema is a JSON Schema dictionary that defines the structure of the data to extract.
# You can also use a Pydantic BaseModel instead of a JSON Schema dictionary.
book = await extract_structured_data(
source=page,
strategy="HTML", # The HTML strategy is the default strategy and will be used if no strategy is provided.
model="gpt-4o",
data_schema={
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"description": {"type": "string"},
"in_stock": {"type": "string"}, # Must be string because we enabled DOM matching
"rating": {"type": "string"}
},
"required": ["name", "price"]
},
prompt="Extract book details from this page",
enable_cache=True, # since this is True, the method will call AI for the first time, and then whenever you call this method it will return cached results as long as the DOM is the same.
enable_dom_matching=True, # since this is True, the method will return the results mapped to the DOM elements, you MUST enable cache for this to work.
max_retires=3
)
print(f"Found book: {book['name']} - {book['price']}")
```
```python Extract all books listings theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser.ai import extract_structured_data
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto('https://books.toscrape.com/')
# This will extract all the books listings from the page, using the HTML strategy with the claude-3-7-sonnet-latest model.
# The data_schema is a JSON Schema dictionary that defines the structure of the data to extract.
# You can also use a Pydantic BaseModel instead of a JSON Schema dictionary.
books = await extract_structured_data(
source=page,
strategy="HTML",
model="claude-3-7-sonnet-latest",
data_schema={
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "string"},
"availability": {"type": "string"}
}
}
}
}
},
prompt="Extract all book listings",
enable_cache=False, # In this example, we don't want to cache the extracted data, we want to extract the data every time.
)
for book in books['products']:
print(f"{book['title']}: {book['price']}")
```
## Arguments
Playwright Page object to extract data from the entire page or Locator object to extract data from a specific element.
Schema defining the structure of the data to extract. Can be either a Pydantic BaseModel class or a JSON Schema dictionary.
Optional prompt to guide the extraction process and provide more context. Defaults to None.
Type of extraction strategy:
* **"HTML"** (default) - Best for text-heavy pages with structured content
* **"IMAGE"** - Best for visual content, charts, or complex layouts
* **"MARKDOWN"** - Best for article-style content with semantic structure
Whether to enable DOM element matching during extraction. When enabled, all types in the schema must be strings to match with the DOM elements. Extraction results are mapped to their corresponding DOM elements and returned with matched results. These results are intelligently cached, allowing subsequent extractions with minor DOM changes to utilize the cached data for improved performance. Defaults to False.
Whether to enable caching of extraction results. Defaults to True.
Maximum number of retry attempts on failures. Failures can be validation errors, API errors, output errors, etc. Defaults to 3.
AI model to use for extraction. Defaults to "claude-haiku-4-5-20251001".
Optional API key for AI extraction (if provided, will not be billed to your account). Defaults to None.
Extracts structured data from web pages using AI-powered content analysis.
This function provides intelligent data extraction from web pages using various strategies
including HTML parsing, image analysis, and Markdown conversion. Or by using Text or Image Content.
It supports extraction from entire pages or specific elements, with built-in caching and retry mechanisms.
```python theme={null}
async def extract_structured_data(
*,
content: list[ContentItem] | ContentItem,
data_schema: type[BaseModel] | dict[str, Any],
prompt: str | None,
max_retires: int | None,
enable_cache: bool | None,
model: str,
api_key: str | None,
) -> Any
```
Extract data from text, image buffers, or image URLs without requiring a page source.
## Features and limitations
**Features:**
* **Smart Caching:** Hashes content and uses [KV Cache](https://docs.intunedhq.com/docs/01-learn/recipes/kv-cache) for persistent storage
* **Multiple Content Items:** Combine text, images (buffer or URL) for comprehensive extraction
* **Flexible Models:** Use any up-to-date model from anthropic, openai or google based on your needs.
**Limitations:**
* **Model Variability:** Quality varies by model - experiment to find the best fit
* **Schema Design:** Complex schemas may reduce accuracy
* **Content Quality:** Requires meaningful, contextual content for accurate extraction - sparse or ambiguous content produces poor results
## Examples
```python Basic Text Content Extraction theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser.ai import extract_structured_data, TextContentItem
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
# This will extract the person information from the text, using the gpt-4o model.
text_content: TextContentItem = {
"type": "text",
"data": "John Doe, age 30, works as a Software Engineer at Tech Corp"
}
person = await extract_structured_data(
content=text_content,
model="gpt-4o",
data_schema={
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
"occupation": {"type": "string"},
"company": {"type": "string"}
},
"required": ["name"]
},
prompt="Extract person information from the text"
)
print(f"Found person: {person['name']}, {person['age']} years old")
```
```python List Extraction from Text Content theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser.ai import extract_structured_data, TextContentItem
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
text_content: TextContentItem = {
"type": "text",
"data": "iPhone 15 - $999, Samsung Galaxy - $899, Pixel 8 - $699"
}
products = await extract_structured_data(
content=text_content,
model="gpt-4o",
data_schema={
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"}
}
}
}
}
},
prompt="Extract all products"
)
for product in products['products']:
print(f"{product['name']}: {product['price']}")
```
## Arguments
Content to extract data from - can be a single content item or array of [ContentItem](../type-references/ContentItem).
Schema defining the expected structure of the extracted data. Can be either a Pydantic BaseModel class or a JSON Schema dictionary.
Optional prompt to guide the extraction process and provide more context. Defaults to None.
Maximum number of retry attempts on failures. Failures can be validation errors, API errors, output errors, etc. Defaults to 3.
Whether to enable caching of the extracted data. Defaults to True.
AI model to use for extraction. Defaults to "claude-haiku-4-5-20251001".
Optional API key for AI extraction (if provided, will not be billed to your account). Defaults to None.
## Returns: `Any`
The extracted structured data conforming to the provided schema.
# is_page_loaded
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/ai/functions/is_page_loaded
This function uses AI and incurs costs.
Uses AI vision to determine if a webpage has finished loading by analyzing a screenshot.
Detects loading spinners, blank content, or incomplete page states.
```python theme={null}
async def is_page_loaded(
page: Page,
*,
model: str,
timeout_s: int,
api_key: str | None,
) -> bool
```
## Examples
```python Check Page Loading theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser.ai import is_page_loaded
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
# Wait for page to finish loading
await page.goto('https://sandbox.intuned.dev/')
page_loaded = await is_page_loaded(page)
if page_loaded:
# Continue with scraping or interactions
print("Page is loaded")
else:
# Wait longer or retry
await page.wait_for_timeout(5000)
```
```python Loading Loop theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser.ai import is_page_loaded
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
# Keep checking until page loads
await page.goto("https://example.com")
attempts = 0
while attempts < 10: # We will retry up to 10 times with a 2-second delay between attempts.
page_loaded = await is_page_loaded(
page,
model="claude-3-7-sonnet-latest",
timeout_s=5
)
if page_loaded:
break # If the page is loaded, break the loop.
await page.wait_for_timeout(2000) # Wait for 2 seconds before the next attempt.
attempts += 1
```
## Arguments
The Playwright page to check
Screenshot timeout in seconds. Defaults to 10.
AI model to use for the check. Defaults to "claude-haiku-4-5-20251001".
Optional API key for the AI service (if provided, will not be billed to your account). Defaults to None.
## Returns: `bool`
True if page is loaded, False if still loading
# ContentItem
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/ai/type-references/ContentItem
A union type representing content items for AI data extraction from various content types.
This type alias defines the complete set of content types supported by the content-based
extract\_structured\_data function for extracting data from text, image buffers, or image URLs
without requiring a page source.
Type variants:
* `TextContentItem`: [TextContentItem](../type-references/TextContentItem) for text data extraction
* `ImageBufferContentItem`: [ImageBufferContentItem](../type-references/ImageBufferContentItem) for image data stored as bytes buffer
* `ImageUrlContentItem`: [ImageUrlContentItem](../type-references/ImageUrlContentItem) for image data accessible via URL
```python theme={null}
type ContentItem = TextContentItem | ImageBufferContentItem | ImageUrlContentItem
```
## Examples
```python Text Content theme={null}
from intuned_browser.ai import TextContentItem
async def automation(page, params, **_kwargs):
text_content: TextContentItem = {
"type": "text",
"data": "John Doe, age 30, works as a Software Engineer at Tech Corp"
}
```
```python Image Buffer Content theme={null}
from intuned_browser.ai import ImageBufferContentItem
async def automation(page, params, **_kwargs):
# Assuming you have image data as bytes
with open("image.png", "rb") as f:
image_data = f.read()
image_content: ImageBufferContentItem = {
"type": "image-buffer",
"image_type": "png",
"data": image_data
}
```
```python Image URL Content theme={null}
from intuned_browser.ai import ImageUrlContentItem
async def automation(page, params, **_kwargs):
image_content: ImageUrlContentItem = {
"type": "image-url",
"image_type": "jpeg",
"data": "https://example.com/image.jpg"
}
```
# ImageBufferContentItem
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/ai/type-references/ImageBufferContentItem
Image buffer content item for content-based extraction.
```python theme={null}
class ImageBufferContentItem(dict)
```
## Properties
The type of the content item, which is always "image-buffer".
The image format (e.g., "png", "jpeg", "gif", "webp").
The buffer containing the raw image data.
# ImageUrlContentItem
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/ai/type-references/ImageUrlContentItem
Image URL content item for content-based extraction.
```python theme={null}
class ImageUrlContentItem(dict)
```
## Properties
The type of the content item, which is always "image-url".
The image format (e.g., "png", "jpeg", "gif", "webp").
The URL of the image.
# TextContentItem
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/ai/type-references/TextContentItem
Text content item for content-based extraction.
```python theme={null}
class TextContentItem(dict)
```
## Properties
The type of the content item, which is always "text".
The text data to extract from.
# click_until_exhausted
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/helpers/functions/click_until_exhausted
Repeatedly click a button until no new content appears or max clicks reached.
This function is useful for "Load More" buttons or paginated content where you need to
keep clicking until all content is loaded. It provides several stopping conditions:
* Button becomes invisible/disabled
* Maximum number of clicks reached
* No change detected in container content (when container\_locator is provided)
```python theme={null}
async def click_until_exhausted(
page: Page,
button_locator: Locator,
heartbeat: Callable[[], None],
*,
container_locator: Locator | None,
max_clicks: int,
click_delay: float,
no_change_threshold: int,
)
```
## Examples
```python Load All Items theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import click_until_exhausted
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://sandbox.intuned.dev/load-more")
load_more_button = page.locator("main main button") # Select the main button in the main content area.
# Click until button disappears or is disabled
await click_until_exhausted(
page=page,
button_locator=load_more_button,
max_clicks=20
)
# Will keep clicking the button until the button disappears or is disabled or the max_clicks is reached.
```
```python Track Container Changes theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import click_until_exhausted
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://sandbox.intuned.dev/load-more")
load_more_button = page.locator("aside button") # Select the button in the sidebar.
container = page.locator('xpath=//*[@id="root"]/div[1]/main/slot/div/aside/div/div/slot/slot') # Watch the sidebar container to detect changes.
# This will count the elements under the container given before each click and after, if the count is the same, the function will stop.
click_count = 0
def heartbeat_callback():
nonlocal click_count
click_count += 1
print(f"Clicked {click_count} times")
await click_until_exhausted(
page=page,
button_locator=load_more_button,
container_locator=container,
heartbeat=heartbeat_callback,
max_clicks=30,
click_delay=0.5,
no_change_threshold=0
)
# Will keep clicking the button until the button disappears or is disabled or the max_clicks is reached or no more content is loaded.
```
## Arguments
Playwright Page object
Locator for the button to click repeatedly
Optional callback invoked after each click. Defaults to lambda: None.
Optional content container to detect changes. Defaults to None.
Maximum number of times to click the button. Defaults to 50.
Delay after each click (in seconds). Defaults to 0.5.
Minimum change in content size to continue clicking. Defaults to 0.
## Returns: `None`
Function completes when clicking is exhausted
# download_file
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/helpers/functions/download_file
Downloads a file from a web page using various trigger methods. This function provides three flexible ways to initiate file downloads:
* **URL**: Creates a new page, navigates to the URL, waits for download, then automatically closes the page. Ideal for direct download links.
* **Locator**: Uses the current page to click the element and capture the resulting download. Perfect for download buttons or interactive elements.
* **Callback**: Executes the provided function with the page object and captures the first triggered download. Offers maximum flexibility for complex download scenarios.
```python theme={null}
async def download_file(
page: Page,
trigger: Trigger,
*,
timeout_s: int,
) -> Download
```
## Examples
```python Download from direct URL theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import download_file
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
# Download from a direct URL, this will open the url and automatically download the content in it.
download = await download_file(
page,
trigger="https://intuned-docs-public-images.s3.amazonaws.com/32UP83A_ENG_US.pdf"
)
file_name = download.suggested_filename
return file_name
```
```python Locator Trigger theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import download_file
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://sandbox.intuned.dev/pdfs")
download = await download_file(
page,
trigger=page.locator("xpath=//tbody/tr[1]//*[name()='svg']")
)
file_name = download.suggested_filename
return file_name
```
```python Callback Trigger theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import download_file
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://sandbox.intuned.dev/pdfs")
download = await download_file(
page,
trigger=lambda page: page.locator("xpath=//tbody/tr[1]//*[name()='svg']").click()
)
file_name = download.suggested_filename
return file_name
```
## Arguments
The Playwright Page object to use for the download.
The [Trigger](../type-references/Trigger) method to initiate the download.
Maximum time in seconds to wait for download to start. Defaults to 5.
## Returns: `Download`
The [Playwright Download object](https://playwright.dev/python/docs/api/class-download) representing the downloaded file.
# extract_markdown
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/helpers/functions/extract_markdown
Converts HTML content from a Playwright Page or Locator to semantic markdown format.
```python theme={null}
async def extract_markdown(
source: Page | Locator,
) -> str
```
## Examples
```python Extract Markdown from Locator theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import extract_markdown
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://books.toscrape.com/")
header_locator = page.locator('h1').first # First title on the page
markdown = await extract_markdown(header_locator) # Extract markdown from the first title
print(markdown)
return markdown
```
```python Extract Markdown from Page theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import extract_markdown
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://sandbox.intuned.dev/pdfs")
markdown = await extract_markdown(page)
print(markdown)
return markdown
```
## Arguments
The source of the HTML content. When a Page is provided, extracts from the entire page. When a Locator is provided, extracts from that specific element.
## Returns: `str`
The markdown representation of the HTML content
# filter_empty_values
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/helpers/functions/filter_empty_values
Recursively filters out empty values from nested objects and arrays.
This function removes the following empty values:
* `None` values
* Empty strings (after trimming whitespace)
* Empty lists
* Empty dictionaries
* Lists and dictionaries that become empty after filtering their contents
```python theme={null}
def filter_empty_values(
data: T,
) -> T
```
## Examples
```python Basic Usage theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import filter_empty_values
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
# Filter empty values from dictionary
result1 = filter_empty_values({"a": "", "b": "hello", "c": None})
# Output: {"b": "hello"}
print(result1)
# Filter empty values from list
result2 = filter_empty_values([1, "", None, [2, ""]])
# Output: [1, [2]]
print(result2)
# Filter nested structures
result3 = filter_empty_values({"users": [{"name": ""}, {"name": "John"}]})
# Output: {"users": [{"name": "John"}]}
print(result3)
return "All data filtered successfully"
```
## Arguments
The data structure to filter (dict, list, or any other type)
## Returns: `T`
Filtered data structure with empty values removed
# go_to_url
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/helpers/functions/go_to_url
This function has multiple overloads
Navigates to a specified URL with enhanced reliability features including automatic retries with exponential backoff,
intelligent timeout handling, and optional AI-powered loading verification.
This function handles common navigation challenges by automatically retrying failed requests, detecting navigation hangs, and ensuring the page reaches a truly idle state.
```python theme={null}
async def go_to_url(
page: Page,
url: str,
*,
timeout_s: int,
retries: int,
wait_for_load_state: str,
throw_on_timeout: bool,
wait_for_load_using_ai: Literal[False],
) -> None
```
Use this overload for standard navigation without AI-powered loading detection.
## Examples
```python Without options theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import go_to_url
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await go_to_url(
page,
url='https://sandbox.intuned.dev/'
)
# At this point, go_to_url has waited for the page to be loaded and the network requests to be settled.
```
```python With options theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import go_to_url
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await go_to_url(
page,
url='https://intunedhq.com',
wait_for_load_state="domcontentloaded", # Faster than "load" state. The function automatically waits for the page to settle.
throw_on_timeout=True,
timeout_s=10,
retries=3
)
# At this point, DOM content is loaded and go_to_url has waited for network requests to settle.
```
## Arguments
The Playwright Page object to navigate.
The URL to navigate to.
Maximum navigation time in seconds. Defaults to 30.
Number of retry attempts with exponential backoff (factor: 2). Defaults to 3.
When to consider navigation succeeded. Defaults to "load".
Whether to raise an error on navigation timeout. When False, the function returns without throwing, allowing continued execution. Defaults to False.
Set to False to disable AI-powered loading checks. Defaults to False.
Navigates to a specified URL with enhanced reliability features including automatic retries with exponential backoff,
intelligent timeout handling, and optional AI-powered loading verification.
This function handles common navigation challenges by automatically retrying failed requests, detecting navigation hangs, and ensuring the page reaches a truly idle state.
```python theme={null}
async def go_to_url(
page: Page,
url: str,
*,
timeout_s: int,
retries: int,
wait_for_load_state: str,
throw_on_timeout: bool,
wait_for_load_using_ai: Literal[True],
model: str | None,
api_key: str | None,
) -> None
```
Use this overload when you need AI vision to verify the page is fully loaded by checking for loading spinners, blank content, or incomplete states.
## Examples
```python With AI Loading Detection theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import go_to_url
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await go_to_url(
page,
url='https://intunedhq.com',
wait_for_load_using_ai=True,
model="gpt-4o"
)
# The page is loaded and ready to use.
# If the AI check fails, the method won't throw even if throw_on_timeout is true.
# It only throws if the page times out reaching the default load state and throw_on_timeout is true.
```
## Arguments
The Playwright Page object to navigate.
The URL to navigate to.
Maximum navigation time in seconds. Defaults to 30.
Number of retry attempts with exponential backoff (factor: 2). Defaults to 3.
When to consider navigation succeeded. Defaults to "load".
Whether to raise an error on navigation timeout. When False, the function returns without throwing, allowing continued execution. Defaults to False.
Must be set to True to use this AI-powered overload. When true, uses AI vision to verify the page is fully loaded by checking for loading spinners, blank content, or incomplete states. Retries up to 3 times with 5-second delays. Check [is\_page\_loaded](../../ai/functions/is_page_loaded) for more details on the AI loading verification.
AI model to use for loading verification. Defaults to "gpt-5-mini-2025-08-07".
Optional API key for the AI check. Defaults to None.
## Returns: `None`
Function completes when navigation is finished or fails after retries.
# process_date
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/helpers/functions/process_date
Parses various date string formats into datetime objects, returning only the date part with time set to midnight.
This utility function provides robust date parsing capabilities for a wide range of common formats.
```python theme={null}
def process_date(
date_string: str,
) -> datetime | None
```
## Key features
* Returns only the date part (year, month, day)
* Time is always set to 00:00:00
* Supports multiple international formats
* Handles timezones and AM/PM formats
## Supported formats
The function handles these date format categories:
### Standard date formats
* `DD/MM/YYYY`: "22/11/2024", "13/12/2024"
* `MM/DD/YYYY`: "01/17/2025", "10/25/2024"
* Single-digit variants: "8/16/2019", "9/28/2024"
### Date-time combinations
* With 24-hour time: "22/11/2024 21:19:05"
* With AM/PM: "12/09/2024 9:00 AM"
* With dash separator: "12/19/2024 - 2:00 PM"
### Timezone support
* With timezone abbreviations: "10/23/2024 12:06 PM CST"
* With timezone offset: "01/17/2025 3:00:00 PM CT"
### Text month formats
* Short month: "5 Dec 2024", "11 Sep 2024"
* With time: "5 Dec 2024 8:00 AM PST"
* Full month: "November 14, 2024", "January 31, 2025, 5:00 pm"
## Examples
```python Basic Usage theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import process_date
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
# Basic date string
date1 = process_date("22/11/2024")
print(date1) # 2024-11-22 00:00:00
# Date with time (time is ignored)
date2 = process_date("5 Dec 2024 8:00 AM PST")
print(date2) # 2024-12-05 00:00:00
```
```python Invalid Date theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import process_date
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
# Invalid date returns None
invalid_date = process_date("invalid date")
print(invalid_date) # will return None.
if invalid_date is None:
raise Exception("Invalid date")
return "should not reach here"
```
## Arguments
A string containing a date in various possible formats
## Returns: `datetime | None`
Returns a `datetime` object with only date components preserved (year, month, day), time always set to 00:00:00, or `None` if parsing fails
# resolve_url
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/helpers/functions/resolve_url
This function has multiple overloads
Converts any URL source to an absolute, properly encoded URL.
```python theme={null}
async def resolve_url(
*,
url: str,
base_url: str,
) -> str
```
Combines a relative URL with a base URL string. Use when you have an explicit base URL string to resolve relative paths against.
## Examples
```python Resolve from Base URL String theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import resolve_url
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
# Resolve from base URL string
absolute_url = await resolve_url(
url="/lists/table",
base_url="https://sandbox.intuned.dev"
)
# Returns: "https://sandbox.intuned.dev/lists/table"
print(absolute_url)
return absolute_url
```
## Arguments
The relative or absolute URL to resolve.
Base URL string to resolve relative URLs against.
Converts any URL source to an absolute, properly encoded URL.
```python theme={null}
async def resolve_url(
*,
url: str,
page: Page,
) -> str
```
Uses the current page's URL as the base URL. Use when resolving URLs relative to the current page.
## Examples
```python Resolve from the Current Page's URL theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import resolve_url
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://sandbox.intuned.dev/")
# Resolve from the current page's URL
absolute_url = await resolve_url(
url="/lists/table",
page=page
)
# Returns: "https://sandbox.intuned.dev/lists/table"
print(absolute_url)
return absolute_url
```
## Arguments
The relative or absolute URL to resolve.
Playwright Page object to extract base URL from. The current page URL will be used as the base URL.
Converts any URL source to an absolute, properly encoded URL.
```python theme={null}
async def resolve_url(
*,
url: Locator,
) -> str
```
Extracts the href attribute from a Playwright Locator pointing to an anchor element. Use when extracting and resolving URLs from anchor (``) elements.
## Examples
```python Resolve from Anchor Element theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import resolve_url
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://sandbox.intuned.dev/")
# Resolve from Anchor Element
absolute_url = await resolve_url(
url=page.locator("xpath=//a[normalize-space()='Steps Form']"),
)
# Returns: "https://sandbox.intuned.dev/steps-form"
print(absolute_url)
return absolute_url
```
## Arguments
Playwright Locator pointing to an anchor element. The href attribute will be extracted and resolved relative to the current page.
## Returns: `str`
The absolute, properly encoded URL string
# sanitize_html
Source: https://docs.intunedhq.com/automation-sdks/intuned-sdk/python/helpers/functions/sanitize_html
Sanitizes and cleans HTML content by removing unwanted elements, attributes, and whitespace.
Provides fine-grained control over each cleaning operation through configurable options.
```python theme={null}
def sanitize_html(
html: str,
*,
remove_scripts: bool,
remove_styles: bool,
remove_svgs: bool,
remove_comments: bool,
remove_long_attributes: bool,
max_attribute_length: int,
preserve_attributes: list[str] | None,
remove_empty_tags: bool,
preserve_empty_tags: list[str] | None,
minify_whitespace: bool,
) -> str
```
## Examples
```python Basic Sanitization theme={null}
from typing import TypedDict
from playwright.async_api import Page
from intuned_browser import sanitize_html
class Params(TypedDict):
pass
async def automation(page: Page, params: Params, **_kwargs):
await page.goto("https://books.toscrape.com")
first_row = page.locator("ol.row").locator("li").first
# Get the HTML of the first row.
html = await first_row.inner_html()
# Sanitize the HTML.
sanitized_html = sanitize_html(html)
# Log the sanitized HTML.
print(sanitized_html)
# Return the sanitized HTML.
return sanitized_html
```
## Arguments
The HTML content to sanitize
Remove all `