How to scrape files - persist and extract data
Goal
Its very common in scraping workloads to need to scrape data from files. Think contracts, financial statements, product specs, etc. In this guide, we will show you how you can use Intuned to extract data from webpages and files in a reliable and scalable way.
To do that, we will use the following page https://sandbox.intuned.dev/pdfs as an example. The page contains a list of products along with a specs file for each of them. Our goal is to build an API that will extract data about each product and return them.
This guide will not go into details related to setting up a job or sending the result data to a webhook - we cover those in a different guides. The focus will be the API logic to extract the data using the @intuned/sdk
helpers.
1. Extract list of products from table
- In a new or existing project, create a new API. Call it
products.ts
. - Within the created API, add the following code:
This logic uses the extractArrayOptimized
helper to extract the monitors info from the table into a monitors
object.
Run the API and make sure the extractor is reading the right data and working as expected. Create empty parameters when asked.
2. Extract data from the specs file
- Add the following code to the
products.ts
API. Intuned has helpers (extractStructuredDataFromFile
) that extracts data from files. TheextractStructuredDataFromFile
helper takes a file url and a json schema for data you are trying to extract and returns it as a json object. To learn more about file data extraction, checkout File data extraction.
Summary
In this guide, we went over how to extract data from a list of items and then extract data from files. We used the extractArrayOptimized
helper to extract the list of items and extractStructuredDataFromFile
to extract data from files.