@intuned/sdk
, allow you to extract data by providing a schema that describes the desired output.
extractStructuredDataFromPage
and extractStructuredDataFromLocator
extract structured data from a full page or a specific section (using locator) of a page. These methods retrieve the content, pass it to the LLM for extraction, and incur costs based on the input size, schema complexity, and selected strategy.
extractArrayFromPage
extractArrayFromLocator
extractObjectFromPage
extractObjectFromLocator
extractObjectFromPage
and extractObjectFromLocator
, and arrays of basic objects using extractArrayFromPage
and extractArrayFromLocator
. More complex schemas with nested objects or arrays, or properties with non-string types, are not yet supported.
variantKey
and how to use it?variantKey
to differentiate between the two pages and ensure accurate example grouping.
Here’s how you can use the variantKey
:
variantKey
to each distinct webpage or section. The variantKey
should be a string that meaningfully identifies the specific variation of the page structure.extractObjectFromPage
, extractObjectFromLocator
, extractArrayFromPage
, or extractArrayFromLocator
), pass the corresponding variantKey
as an optional parameter.variantKey
to group examples separately for each variant, enabling the creation of static extractors tailored to each specific page structure.variantKey
, you can effectively handle situations where the same optimized extraction logic needs to be applied to pages with different structures, even if they share the same origin. This allows for more precise example grouping and enables the generation of static extractors that are specific to each variant of the page structure.
It’s important to note that the variantKey
should be used only when necessary. In most cases, the default behavior of grouping examples by the page origin is sufficient. However, when dealing with complex websites or when you require more granular control over example grouping, the variantKey
provides a powerful mechanism to optimize the extraction process and ensure accurate results.
strategy
and how should it be used?strategy
parameter in the extractor functions allows you to control two key aspects of the extraction process:
"HTML"
: This is the default option. Uses the HTML source of the page or locator for extraction. This strategy is suitable when the desired data is present within the HTML elements and is best extracted based on the DOM structure."IMAGE"
: Uses screenshots of the page or locator for extraction. This strategy is useful when the information you want to extract is primarily visual and not easily identifiable in the HTML structure.model
property within the strategy allows you to specify the desired model. Options are: "claude-3-opus"
, "claude-3-sonnet"
, "claude-3-haiku"
, "gpt4-turbo"
, or "gpt3.5-turbo"
. By default, the "claude-3-haiku"
model is used."IMAGE"
strategy. If the data is well-structured within the HTML elements, the "HTML"
strategy is more suitable.
extractStructuredDataFromContent
enables extracting data from arbitrary content, useful when you want to extract structured data from some text or an image.