Goal

In this how to guide, we will go over how to scrape data from a website and send it to a webhook. For this example, we will use https://books.toscrape.com/ as the data source. The goal is to scrape data from categories Poetry and Travel and send them to a webhook. We want to be able to do this on demand or on a schedule.

Step by step

1. Create a project and API

  • Create new project.

  • Create new API (books) with the following code. This API receives a param category and scrapes the books in this category. Notice that we are using Optimized extractors here. Notice also that the API returns an array of of the scraped books data.

  • Run it and make sure everything is working well. If you need a test param.

{
  "category": "Novels"
}
  • Deploy the project.
import { BrowserContext, Page } from "@intuned/playwright-core";
import { extendPlaywrightPage } from "@intuned/sdk/playwright";

interface Params {
  // Add your params here
  category: string;
}

export default async function handler(
  params: Params,
  _playwrightPage: Page,
  context: BrowserContext
) {
  const page = extendPlaywrightPage(_playwrightPage);

  await page.goto("https://books.toscrape.com/");

  // playwright logic
  await page.getByRole("link", { name: params.category }).click();

  // @intuned/sdk helper!
  const result = await page.extractArrayOptimized({
    itemEntityName: "book",
    label: "books-scraper",
    itemEntitySchema: {
      type: "object",
      properties: {
        name: {
          type: "string",
          description: "name of the book",
          primary: true,
        },
        price: {
          type: "string",
          description: "price of the book. An example is £26.80",
        },
      },
      required: ["name", "price"],
    },
  });

  return result;
}

2. Create a job with webhook sink

Jobs are a way to schedule recurring or batched/grouped executions. For more info about Jobs, checkout Jobs.

In this example, we know that we need to call the books api twice (one with category Poetry and another with category Travel) and send the result to a webhook. We will create a job that does this on demand.

Creating jobs can be done via UI or API. For this example, we will use the UI. Checkout Jobs API overview for more info.

  • Get a webhook url, for testing, you can use https://webhook.site/ to get a temp url. In a real scenario, you will use your own webhook url and persist the data to store.

  • Go to the Jobs tab in the UX and create a new job with the following config. Don’t forget to replace <YOUR_WEBHOOK_URL> with your webhook url.

{
  "id": "books-webhook",
  "configuration": {
    "runMode": "Order-Irrelevant",
    "maxConcurrentRequests": 5,
    "retry": {
      "maximumAttempts": 5
    }
  },
  "sink": {
    "type": "webhook",
    "url": "<YOUR_WEBHOOK_URL>"
  },
  "payload": [
    {
      "apiName": "books",
      "parameters": {
        "category": "Poetry" 
      }
    },
    {
      "apiName": "books",
      "parameters": {
        "category": "Travel"
      }
    }
  ]
}

3. Trigger the job

  • Now that you have created a job, you can trigger it manually. This will run the job immediately and send the result API data to the webhook.

  • You will have 1 webhook call for each API in the payload.

  • You can look at the webhook payload below. It includes API name, parameters, runId, result and more.

4. Create a scheduled job

Lets assume that you need a job to be ran every day so you can keep your internal store updated with the latest books in some categories.

To do this, you will need to create a new job with a schedule configuration. Here is an example job config:

{
  "id": "books-webhook",
  "configuration": {
    "runMode": "Order-Irrelevant",
    "maxConcurrentRequests": 5,
    "retry": {
      "maximumAttempts": 5
    }
  },
  "schedule": {
    "intervals": [
      {
        "every": "1d"
      }
    ]
  },
  "sink": {
    "type": "webhook",
    "url": "<YOUR_WEBHOOK_URL>"
  },
  "payload": [
    {
      "apiName": "books",
      "parameters": {
        "category": "Poetry" 
      }
    },
    {
      "apiName": "books",
      "parameters": {
        "category": "Travel"
      }
    }
  ]
}