Goal

Its very common in scraping workloads to need to scrape data from files. Think contracts, financial statements, product specs, etc. In this guide, we will show you how you can use Intuned to extract data from webpages and files in a reliable and scalable way.

To do that, we will use the following page https://sandbox.intuned.dev/pdfs as an example. The page contains a list of products along with a specs file for each of them. Our goal is to build an API that will extract data about each product and return them.

This guide will not go into details related to setting up a job or sending the result data to a webhook - we cover those in a different guides. The focus will be the API logic to extract the data using the @intuned/sdk helpers.

https://sandbox.intuned.dev/pdfs

1. Extract list of products from table

  • In a new or existing project, create a new API. Call it products.ts.
  • Within the created API, add the following code:

// new logic to add to the newly created API.
await page.goto("https://sandbox.intuned.dev/pdfs");

// 
const monitors = await page.extractArrayOptimized({
    label: "pdf demo site",
    itemEntityName: "monitor",
    itemEntitySchema: {
        type: "object",
        properties: {
            "name": {
                type: "string",
                primary: true,
            },
            "manufacturer": {
                type: "string",
            },
            "model": {
                type: "string",
            },
            "spec_href": {
                type: "string",
                "description": "href value of the spec for the monitor"
            }
        },
        required: ["name", "spec_href", "manufacturer", "model"],
    },
});

console.log(monitors);

This logic uses the extractArrayOptimized helper to extract the monitors info from the table into a monitors object.

Run the API and make sure the extractor is reading the right data and working as expected. Create empty parameters when asked.

Run monitors API - part 1

2. Extract data from the specs file

  • Add the following code to the products.ts API. Intuned has helpers (extractStructuredDataFromFile) that extracts data from files. The extractStructuredDataFromFile helper takes a file url and a json schema for data you are trying to extract and returns it as a json object. To learn more about file data extraction, checkout File data extraction.
    for (const monitor of monitors) {
        const specs = await extractStructuredDataFromFile({
            type: "pdf",
            source: {
                type: "url",
                "data": monitor.spec_href,
            },
        }, {
            label: "spec files",
            dataSchema: {
                type: "object",
                properties: {
                    "models": {
                        description: "models number included in this spec sheet",
                        type: "array",
                        items: {
                            type: "string"
                        }
                    },
                    "color_depth": {
                        type: "string",
                        description: "color depth of the monitor"
                    },
                    "max_resolution": {
                        type: "string",
                        description: "max rolustion of the screen and at what hz"
                    },
                    "power_source": {
                        type: "object",
                        properties: {
                            "power_rating": {
                                type: "string",
                            },
                            "prowser_consumption": {
                                type: "string",
                            }
                        }
                    },
                    "adpator": {
                        type: "string",
                        "description": "AC AD adaptor specs"
                    },
                    "dimensions": {
                        type: "object",
                        properties: {
                            "with_stand": {
                                type: "string"
                            },
                            "without_stand": {
                                type: "string"
                            }
                        },
                    },
                    "weight": {
                        type: "object",
                        properties: {
                            "with_stand": {
                                type: "string"
                            },
                            "without_stand": {
                                type: "string"
                            }
                        },
                    }
                },
                required: ["models", "color_depth", "max_resolution", "power_source", "adpator", "dimensions", "weight"],
            }
        });

        monitor.specs = specs;
    }

    return monitors;

Summary

In this guide, we went over how to extract data from a list of items and then extract data from files. We used the extractArrayOptimized helper to extract the list of items and extractStructuredDataFromFile to extract data from files.