How to scrape files - persist and extract data

Goal

Its very common in scraping workloads to need to scrape data from files. Think contracts, financial statements, product specs, etc. In this guide, we will show you how you can use Intuned to extract data from webpages and files in a reliable and scalable way.

To do that, we will use the following page https://sandbox.intuned.dev/pdfs as an example. The page contains a list of products along with a specs file for each of them. Our goal is to build an API that will extract data about each product and return them.

This guide will not go into details related to setting up a job or sending the result data to a webhook - we cover those in a different guides. The focus will be the API logic to extract the data using the @intuned/sdk helpers.

1. Extract list of products from table

In a new or existing project, create a new API. Call it products.ts.
Within the created API, add the following code:

// new logic to add to the newly created API.
await page.goto("https://sandbox.intuned.dev/pdfs");

// 
const monitors = await page.extractArrayOptimized({
    label: "pdf demo site",
    itemEntityName: "monitor",
    itemEntitySchema: {
        type: "object",
        properties: {
            "name": {
                type: "string",
                primary: true,
            },
            "manufacturer": {
                type: "string",
            },
            "model": {
                type: "string",
            },
            "spec_href": {
                type: "string",
                "description": "href value of the spec for the monitor"
            }
        },
        required: ["name", "spec_href", "manufacturer", "model"],
    },
});

console.log(monitors);

This logic uses the extractArrayOptimized helper to extract the monitors info from the table into a monitors object.

Run the API and make sure the extractor is reading the right data and working as expected. Create empty parameters when asked.

2. Extract data from the specs file

Add the following code to the products.ts API. Intuned has helpers (extractStructuredDataFromFile) that extracts data from files. The extractStructuredDataFromFile helper takes a file url and a json schema for data you are trying to extract and returns it as a json object. To learn more about file data extraction, checkout File data extraction.

    for (const monitor of monitors) {
        const specs = await extractStructuredDataFromFile({
            type: "pdf",
            source: {
                type: "url",
                "data": monitor.spec_href,
            },
        }, {
            label: "spec files",
            dataSchema: {
                type: "object",
                properties: {
                    "models": {
                        description: "models number included in this spec sheet",
                        type: "array",
                        items: {
                            type: "string"
                        }
                    },
                    "color_depth": {
                        type: "string",
                        description: "color depth of the monitor"
                    },
                    "max_resolution": {
                        type: "string",
                        description: "max rolustion of the screen and at what hz"
                    },
                    "power_source": {
                        type: "object",
                        properties: {
                            "power_rating": {
                                type: "string",
                            },
                            "prowser_consumption": {
                                type: "string",
                            }
                        }
                    },
                    "adpator": {
                        type: "string",
                        "description": "AC AD adaptor specs"
                    },
                    "dimensions": {
                        type: "object",
                        properties: {
                            "with_stand": {
                                type: "string"
                            },
                            "without_stand": {
                                type: "string"
                            }
                        },
                    },
                    "weight": {
                        type: "object",
                        properties: {
                            "with_stand": {
                                type: "string"
                            },
                            "without_stand": {
                                type: "string"
                            }
                        },
                    }
                },
                required: ["models", "color_depth", "max_resolution", "power_source", "adpator", "dimensions", "weight"],
            }
        });

        monitor.specs = specs;
    }

    return monitors;

Summary

In this guide, we went over how to extract data from a list of items and then extract data from files. We used the extractArrayOptimized helper to extract the list of items and extractStructuredDataFromFile to extract data from files.

How to scrape data and integrate with webhooks

For more info on Jobs and how to use them.

Jobs

For more info on Jobs and how to use them.

Getting started

Guides

Platform

Data extraction

Actions

Authentication sessions

Support

How to scrape files - persist and extract data

Goal

1. Extract list of products from table

2. Extract data from the specs file

Summary

How to scrape data and integrate with webhooks

Jobs

Getting started

Guides

Platform

Data extraction

Actions

Authentication sessions

Support

​Goal

​1. Extract list of products from table

​2. Extract data from the specs file

​Summary

​Related

How to scrape data and integrate with webhooks

Jobs

Goal

1. Extract list of products from table

2. Extract data from the specs file

Summary

Related