Data extraction is a fundamental task in browser automation and web scraping. In some cases, data also lives in files of different formats.

Traditionally, data extraction is unreliable and error-prone, requiring custom code to parse, clean, and transform data into a usable format. This process is labor-intensive, error-prone and time-consuming.

At Intuned, we streamline data extraction to be easy and reliable by leveraging LLMs. We offer a suite of utilities that simplify the extraction of data from both websites and files.

This section focuses on these capabilities. The following is a summary of the available utilities:

Intuned Automation Projects

The Intuned SDK includes several helper methods designed for data extraction, available under the following namespaces:

@intuned/sdk/ai-extractors

  • Web Data Extraction: Utilities to extract structured data from webpages. Use of these methods will include cost, costs depend on the webpage size and the requested data schema.

  • File Data Extraction: Utilities for extracting structured data from files. Use of these methods will include cost, costs will vary based on the number of pages, file contents and the requested data schema.

  • Web Markdown Conversion: Convert webpages to markdown.

  • File to Markdown Conversion: Convert files to markdown. This uses our file processing pipeline, with costs based on the number of file pages processed.

  • Table Extraction from Files: Extract tables from files. This uses our file processing pipeline, with costs based on the number of file pages processed.

@intuned/sdk/optimized-extractors

Utilities for optimized data extraction from web pages, focusing on cost-efficiency. These utilities aim to minimize the reliance on LLMs. These utilities support a limited set of schemas and are restricted in the type of data they can extract. Further details on these utilities will be discussed here.

@intuned/sdk/playwright

Static Extraction Utilities to extract data from webpages with selectors. These utilities require manual configuration of selectors and incur no cost when used. Checkout extractArrayFromPageUsingSelectors and extractObjectFromPageUsingSelectors for more info.

playwright

playwright can directly be used to interact with webpages and extract data.

Standalone file APIs

In addition to the @intuned/sdk utilities, we offer standalone APIs for file data extraction. These APIs can be utilized without creating projects or writing any browser automation logic, with costs varying based on the operation used and file size. More details are available here.