In this how to guide, we will go over how to scrape data from a website and send it to your S3 bucket. For this example, we will use https://books.toscrape.com/ as the data source. The goal is to scrape data from categories Poetry and Travel and send them to a S3 bucket. We want to be able to do this on demand or on a schedule.
Create new API (books) with the following code. This API receives a param category and scrapes the books in this category. Notice that we are using Optimized extractors here. Notice also that the API returns an array of of the scraped books data.
Run it and make sure everything is working well.
Deploy the project.
Copy
import { BrowserContext, Page } from "playwright-core";import { extendPlaywrightPage } from "@intuned/sdk/playwright";interface Params { // Add your params here category: string;}export default async function handler( params: Params, _playwrightPage: Page, context: BrowserContext) { const page = extendPlaywrightPage(_playwrightPage); await page.goto("https://books.toscrape.com/"); // playwright logic await page.getByRole("link", { name: params.category }).click(); // @intuned/sdk helper! const result = await page.extractArrayOptimized({ itemEntityName: "book", label: "books-scraper", itemEntitySchema: { type: "object", properties: { name: { type: "string", description: "name of the book", primary: true, }, price: { type: "string", description: "price of the book. An example is £26.80", }, }, required: ["name", "price"], }, }); return result;}
Jobs are a way to schedule recurring or batched/grouped executions. For more info about Jobs, checkout Jobs.
In this example, we know that we need to call the books api twice (one with category Poetry and another with category Travel) and send the result to a S3 bucket. We will create a job that does this on demand.
Creating jobs can be done via UI or API. For this example, we will use the UI. Checkout Jobs API overview for more info.
Get a S3 bucket configuration including the bucket name, region, and the AWS access keys with write access to the bucket.
Go to the Jobs tab in the UX and create a new job with the following config. Don’t forget to replace <YOUR_ACCESS_KEY_ID> and <YOUR_SECRET_ACCESS_KEY> with your AWS access keys. Checkout create job reference for more info.
Now that you have created a job, you can trigger it manually. This will run the job immediately and send the result API data to the S3 bucket.
You will have S3 record for each API in the payload.
You can look at the S3 payload below. It includes API name, parameters, runId, result and more.
Copy
{ "apiInfo": { "name": "books", "parameters": { "category": "Travel" }, "runId": "-cCPG1rLxQV-3Dl", "result": { "status": "completed", "result": [ { "name": "It's Only the Himalayas", "price": "£45.17" }, { "name": "Full Moon over Noah's Ark: An Odyssey to Mount Ararat and Beyond", "price": "£49.43" }, { "name": "See America: A Celebration of Our National Parks & Treasured Sites", "price": "£48.87" }, { "name": "Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel", "price": "£36.94" }, { "name": "Under the Tuscan Sun", "price": "£37.33" }, { "name": "A Summer In Europe", "price": "£44.34" }, { "name": "The Great Railway Bazaar", "price": "£30.54" }, { "name": "A Year in Provence (Provence #1)", "price": "£56.88" }, { "name": "The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)", "price": "£23.21" }, { "name": "Neither Here nor There: Travels in Europe", "price": "£38.95" }, { "name": "1,000 Places to See Before You Die", "price": "£26.08" } ], "statusCode": 200 } }, "workspaceId": "8ee20714-1d06-4a49-9d2d-d033aaed8031", "project": { "name": "new-project", "id": "6b4930d6-90aa-4f9f-8d0c-3914d432ba45" }, "projectJob": { "id": "books-s3" }, "projectJobRun": { "id": "2cc04948-c913-4623-a520-6b44702f3599" }}