Extracting data from webpages is core operation when writing scrapers. Normally, this involves writing custom code to parse and extract data from the HTML. This process is error-prone and time-consuming.

At Intuned, we streamline data extraction to be easy and reliable by leveraging LLMs. Our utilities, available in @intuned/sdk, allow you to extract data by providing a schema that describes the desired output.

AI extractors

extractStructuredDataFromPage and extractStructuredDataFromLocator extract structured data from a full page or a specific section (using locator) of a page. These methods retrieve the content, pass it to the LLM for extraction, and incur costs based on the input size, schema complexity, and selected strategy.

For complete function reference, see extractStructuredDataFromPage and extractStructuredDataFromLocator.

Optimized extractors

Optimized extractors deliver the benefits, reliability and convenience of AI extractors in a cost optimized manner. This is done by only using AI extraction in limited scenarios and creating/using selectors otherwise (More on this later).

There are four optimized extractor methods:

  • extractArrayFromPage
  • extractArrayFromLocator
  • extractObjectFromPage
  • extractObjectFromLocator

Here are these can be used in your Intuned project APIs:

How do they work? How are they saving cost?

Optimized extractors operate in two modes: AI Extraction and Static Extraction.

  • AI extraction: In this mode, the extractor leverages LLMs to extract data directly from the webpage. This is the initial mode used when the extractor is first invoked.

  • Static extraction: After collecting a sufficient number of examples via AI Extraction, the Intuned platform runs background workflows to automatically generate selectors. Once the selectors are correctly generated, the optimized extractors switch to Static Extraction mode, using these cached/auto-generated selectors to extract data from the page. This saves cost by avoiding the need for LLM calls on every extraction.

The platform handles scenarios where invalid data is not returned by the static extractors. This is taken as a signal that the static extractors may have become invalid or the page structure has changed. In such cases, the extractor automatically falls back to AI Extraction mode. After collecting new examples via AI Extraction, the platform recreates the static extractors and returns to the optimized state. It’s important to note that this entire process is managed seamlessly by the Intuned platform. As a user, you simply need to provide the necessary extractor parameters, and the platform takes care of optimizing the extraction process to save costs while maintaining accuracy.

What are the scenarios where optimized extractors perform AI extraction and incur cost?

Optimized extractors perform AI extraction in the following scenarios:

  • Initial extraction: When used for the first time on a new page or locator.
  • Insufficient examples: When collected examples are insufficient to generate reliable selectors.
  • Page Structure Changes/Invalid Extracted Data: When the page structure changes or expected data is not returned by static selectors.

What are the limitations of optimized extractors?

While optimized extractors offer a cost-effective solution for extracting structured data from webpages, they do have certain limitations:

  1. Limited JSONSchema Support**: Currently, optimized extractors have limited support for complex JSONSchema structures. They can handle basic objects (objects with string properties) using extractObjectFromPage and extractObjectFromLocator, and arrays of basic objects using extractArrayFromPage and extractArrayFromLocator. More complex schemas with nested objects or arrays, or properties with non-string types, are not yet supported.

  2. Exact String Extraction: optimized extractors rely on the ability to create static selectors for optimization. To achieve this, the data being extracted must be exact strings that exist in the webpage.

  3. In rare cases, the platform may not be able to generate reliable static selectors, even after collecting multiple examples. In such scenarios, the optimized Extractor will continue to operate in AI Extraction mode, incurring costs for on extraction. To avoid incurring unexpected costs, you can set limits on AI spend using labels. Its also worth mentioning that the Intuned team closely monitors these cases and works on continuously improving the selector generation algorithms.

What is a variantKey and how to use it?

In advanced scenarios, you may want to apply the same optimized extraction logic to different websites with varying page structures. To enable the Intuned platform to group examples effectively and create static extractors per group, we use the concept of variants.

By default, the variant is determined by the origin of the webpage on which the extraction is performed. This means that examples collected from pages with the same origin will be grouped together to generate static extractors specific to that website.

However, there may be cases where you need more fine-grained control over example grouping to make static extraction creation feasible. For instance, consider a situation where you want to extract data from two different pages that have the same origin but different structures. In such cases, you can manually provide a variantKey to differentiate between the two pages and ensure accurate example grouping.

Here’s how you can use the variantKey:

  1. Identify the webpages or sections that require different example grouping, even though they have the same origin.
  2. Assign a unique variantKey to each distinct webpage or section. The variantKey should be a string that meaningfully identifies the specific variation of the page structure.
  3. When calling the optimized Extractor functions (extractObjectFromPage, extractObjectFromLocator, extractArrayFromPage, or extractArrayFromLocator), pass the corresponding variantKey as an optional parameter.
  4. The Intuned platform will use the provided variantKey to group examples separately for each variant, enabling the creation of static extractors tailored to each specific page structure.

By utilizing the variantKey, you can effectively handle situations where the same optimized extraction logic needs to be applied to pages with different structures, even if they share the same origin. This allows for more precise example grouping and enables the generation of static extractors that are specific to each variant of the page structure.

It’s important to note that the variantKey should be used only when necessary. In most cases, the default behavior of grouping examples by the page origin is sufficient. However, when dealing with complex websites or when you require more granular control over example grouping, the variantKey provides a powerful mechanism to optimize the extraction process and ensure accurate results.

When to use optimized extractors vs AI extractors

  • Use AI extractors when:

    • Extracting non-exact strings (e.g., booleans, summaries).
    • Dealing with complex schemas.
    • Expecting a small number of executions.
  • Use optimized extractors when:

    • Cost is a significant factor.
    • Expecting a high number of runs.
    • Page structure is very similar across executions.

Choose the appropriate extractor based on your specific requirements, considering factors such as data complexity, execution frequency, and cost optimization.

What are labels and how to use them?

Labels are used to identify and differentiate extractors for billing and monitoring purposes. Assign a unique label to each extractor to track its usage and costs effectively.

You can also use labels to set limits on AI spend per extractor. More on this later.

What is strategy and how should it be used?

The strategy parameter in the extractor functions allows you to control two key aspects of the extraction process:

  1. Web extraction method: It determines how data is extracted from the webpage before passing it to the LLM. Currently supported strategies are:
  • "HTML": This is the default option. Uses the HTML source of the page or locator for extraction. This strategy is suitable when the desired data is present within the HTML elements and is best extracted based on the DOM structure.
  • "IMAGE": Uses screenshots of the page or locator for extraction. This strategy is useful when the information you want to extract is primarily visual and not easily identifiable in the HTML structure.
  1. LLM selection: The strategy also influences the choice of the LLM to use for extraction, which directly impacts the cost. The model property within the strategy allows you to specify the desired model. Options are: "claude-3-opus", "claude-3-sonnet", "claude-3-haiku", "gpt4-turbo", or "gpt3.5-turbo". By default, the "claude-3-haiku" model is used.

When deciding on the strategy to use, consider the following factors:

  • Nature of the page: If the information you want to extract is mainly visual or not easily accessible through the HTML structure, use the "IMAGE" strategy. If the data is well-structured within the HTML elements, the "HTML" strategy is more suitable.

  • Cost considerations: The AI model used for extraction directly affects the cost incurred.

Overall, we suggest that you start with the default strategy (method and model) and iterate based on the results.

For more details, see extractArrayFromPage, extractArrayFromLocator, extractObjectFromPage, and extractObjectFromLocator.

extractStructuredDataFromContent

extractStructuredDataFromContent enables extracting data from arbitrary content, useful when you want to extract structured data from some text or an image.

For more details, see extractStructuredDataFromContent.