How to scrape a big list - nested scheduling
Goal
Lets assume that we want to scrape the books data from https://books.toscrape.com/, however, we are interested in more than the book name and we want to get the book name, UPC and the number of reviews on each book. If you go to the website, you will see that the info on the main page doesn’t contain this extra info (UPC as an example) and that we would need to navigate to each book details page to get this info.
To scrape this data, we need to build a job that will go to the main list, get list to items to scrape and then scrape them one by one. Given that its a scraper, we want to show how to run this process every day and send the data to a webhook.
This guide will explain how to do the above in Intuned. We will use Nested scheduling that can be done within a Job.
Step by step
1. Create a project and required APIs
-
Create new project.
-
Create
book-details.ts
API. This API take a parambookFullUrl
and then navigates to this url, scrapes the needed data (name, upc, numberOfReviews) from that page and returns it.
- Create
books-all.ts
API. This API scrapes the name and url for all the books on https://books.toscrape.com/. For each url, it calls theextendPayload
function. This function extends the payload of job run and add a new payload item to it. This new payload item will run as part of the same job run.
- Deploy the project.
2. Create a job
Jobs are a way to schedule recurring or batched/grouped executions. For more info about Jobs, checkout Jobs.
In this scenario, we will schedule a job that runs daily with one payload item (api: books-all
). When this API run, it will extend the job run payload and include 1 payload item (api: book-details
) for each book on the main page.
Creating jobs can be done via UI or API. For this example, we will use the UI. Checkout Jobs API overview for more info.
-
Get a webhook url, for testing, you can use
https://webhook.site/
to get a temp url. In a real scenario, you will use your own webhook url and persist the data to store. -
Go to the Jobs tab in the UX and create a new job with the following config. Don’t forget to replace
<YOUR_WEBHOOK_URL>
with your webhook url.
3. Trigger the job
-
Now that you have created a job, you can trigger it manually. This will run the job immediately and send the result APIs data to the webhook.
-
You will have 1 webhook call for each API in the payload. In this case, even though you have 1 payload item in the job payload config, you will see that the job run had 20 more - one for each payload item that was added by the
books-all
API. -
You can look at the webhook payload below. It includes API name, parameters, runId, result and more.
Summary
In this guide, we went over how to scrape a big list of items by using nested scheduling. We created a job that runs daily and scrapes the main page to get the list of items to scrape. For each item, we extended the job payload and added a new payload item to it. This new payload item is executed part of the same run and all the results are sent to the sink - webhook in this case.