How to scrape data and save to S3
Goal
In this how to guide, we will go over how to scrape data from a website and send it to your S3 bucket. For this example, we will use https://books.toscrape.com/ as the data source. The goal is to scrape data from categories Poetry
and Travel
and send them to a S3 bucket. We want to be able to do this on demand or on a schedule.
Step by step
1. Create a project and API
-
Create new project.
-
Create new API (books) with the following code. This API receives a param
category
and scrapes the books in this category. Notice that we are using Optimized extractors here. Notice also that the API returns an array of of the scraped books data. -
Run it and make sure everything is working well.
-
Deploy the project.
2. Create a job with S3 sink
Jobs are a way to schedule recurring or batched/grouped executions. For more info about Jobs, checkout Jobs.
In this example, we know that we need to call the books
api twice (one with category Poetry
and another with category Travel
) and send the result to a S3 bucket. We will create a job that does this on demand.
Creating jobs can be done via UI or API. For this example, we will use the UI. Checkout Jobs API overview for more info.
-
Get a S3 bucket configuration including the bucket name, region, and the AWS access keys with write access to the bucket.
-
Go to the Jobs tab in the UX and create a new job with the following config. Don’t forget to replace
<YOUR_ACCESS_KEY_ID>
and<YOUR_SECRET_ACCESS_KEY>
with your AWS access keys. Checkout create job reference for more info.
3. Trigger the job
-
Now that you have created a job, you can trigger it manually. This will run the job immediately and send the result API data to the S3 bucket.
-
You will have S3 record for each API in the payload.
-
You can look at the S3 payload below. It includes API name, parameters, runId, result and more.
4. Create a scheduled job
Lets assume that you need a job to be ran every day so you can keep your internal store updated with the latest books in some categories.
To do this, you will need to create a new job with a schedule configuration. Here is an example job config: