Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New puppeteer examples #1355

Merged
merged 6 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion docs/config/config-file.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ sidebarTitle: "Configuration"
description: "This file is used to configure your project and how it's built."
---

import ScrapingWarning from "/snippets/web-scraping-warning.mdx";
import BundlePackages from "/snippets/bundle-packages.mdx";

The `trigger.config.ts` file is used to configure your Trigger.dev project. It is a TypeScript file at the root of your project that exports a default configuration object. Here's an example:
Expand Down Expand Up @@ -473,6 +474,34 @@ export default defineConfig({
});
```

#### puppeteer

<ScrapingWarning />

To use Puppeteer in your project, add these build settings to your `trigger.config.ts` file:

```ts trigger.config.ts
import { defineConfig } from "@trigger.dev/sdk/v3";

export default defineConfig({
project: "<project ref>",
// Your other config settings...
build: {
extensions: [puppeteer()],
},
});
```

And add the following environment variable in your Trigger.dev dashboard on the Environment Variables page:

```bash
PUPPETEER_EXECUTABLE_PATH: "/usr/bin/google-chrome-stable",
```

<Note>
Ensure you use `puppeteer` not `puppeteer-core` in your build configuration.
</Note>

#### ffmpeg

You can add the `ffmpeg` build extension to your build process:
Expand All @@ -482,7 +511,7 @@ import { defineConfig } from "@trigger.dev/sdk/v3";
import { ffmpeg } from "@trigger.dev/build/extensions/core";

export default defineConfig({
//..other stuff
// Your other config settings...
build: {
extensions: [ffmpeg()],
},
Expand Down
1 change: 1 addition & 0 deletions docs/examples/intro.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ description: "Learn how to use Trigger.dev with these practical task examples."
| [OpenAI with retrying](/examples/open-ai-with-retrying) | Create a reusable OpenAI task with custom retry options. |
| [PDF to image](/examples/pdf-to-image) | Use `MuPDF` to turn a PDF into images and save them to Cloudflare R2. |
| [React to PDF](/examples/react-pdf) | Use `react-pdf` to generate a PDF and save it to Cloudflare R2. |
| [Puppeteer](/examples/puppeteer) | Use Puppeteer to generate a PDF or scrape a webpage. |
| [Resend email sequence](/examples/resend-email-sequence) | Send a sequence of emails over several days using Resend with Trigger.dev. |
| [Sharp image processing](/examples/sharp-image-processing) | Use Sharp to process an image and save it to Cloudflare R2. |
| [Vercel AI SDK](/examples/vercel-ai-sdk) | Use Vercel AI SDK to generate text using OpenAI. |
213 changes: 213 additions & 0 deletions docs/examples/puppeteer.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
---
title: "Puppeteer"
sidebarTitle: "Puppeteer"
description: "These examples demonstrate how to use Puppeteer with Trigger.dev."
---

import LocalDevelopment from "/snippets/local-development-extensions.mdx";
import ScrapingWarning from "/snippets/web-scraping-warning.mdx";

## Overview

There are 3 example tasks to follow on this page:

1. [Basic example](/examples/puppeteer#basic-example)
2. [Generate a PDF from a web page](/examples/puppeteer#generate-a-pdf-from-a-web-page)
3. [Scrape content from a web page](/examples/puppeteer#scrape-data-from-a-website)

<ScrapingWarning />

## Build configurations

To use all examples on this page, you'll first need to add these build settings to your `trigger.config.ts` file:

```ts trigger.config.ts
import { defineConfig } from "@trigger.dev/sdk/v3";

export default defineConfig({
project: "<project ref>",
// Your other config settings...
build: {
// This is required to use the Puppeteer library
extensions: [puppeteer()],
},
});
```

## Set an environment variable

Add the following environment variable in your Trigger.dev dashboard on the Environment Variables page:

```bash
PUPPETEER_EXECUTABLE_PATH: "/usr/bin/google-chrome-stable",
```

## Basic example

### Overview

In this example we use Puppeteer to log out the title of a web page, in this case Google.

### Task code

```ts trigger/puppeteer-basic-example.ts
import { logger, task } from "@trigger.dev/sdk/v3";
import puppeteer from "puppeteer";

export const puppeteerTask = task({
id: "puppeteer-log-title",
run: async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.goto("https://google.com");

const content = await page.title();
logger.info("Content", { content });

await browser.close();
},
});
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding launch options for better compatibility

The basic example demonstrates Puppeteer usage correctly. However, to ensure better compatibility across different environments, consider adding launch options to the puppeteer.launch() call.

Modify the puppeteer.launch() call as follows:

-    const browser = await puppeteer.launch();
+    const browser = await puppeteer.launch({
+      headless: "new",
+      args: ['--no-sandbox', '--disable-setuid-sandbox']
+    });

This change ensures compatibility with newer versions of Puppeteer and improves stability in various environments, including containerized ones.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Basic example
### Overview
In this example we use Puppeteer to log out the title of a web page, in this case Google.
### Task code
```ts trigger/puppeteer-basic-example.ts
import { logger, task } from "@trigger.dev/sdk/v3";
import puppeteer from "puppeteer";
export const puppeteerTask = task({
id: "puppeteer-log-title",
run: async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://google.com");
const content = await page.title();
logger.info("Content", { content });
await browser.close();
},
});
```
## Basic example
### Overview
In this example we use Puppeteer to log out the title of a web page, in this case Google.
### Task code
```ts trigger/puppeteer-basic-example.ts
import { logger, task } from "@trigger.dev/sdk/v3";
import puppeteer from "puppeteer";
export const puppeteerTask = task({
id: "puppeteer-log-title",
run: async () => {
const browser = await puppeteer.launch({
headless: "new",
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
await page.goto("https://google.com");
const content = await page.title();
logger.info("Content", { content });
await browser.close();
},
});
```
🧰 Tools
LanguageTool

[typographical] ~49-~49: It appears that a comma is missing.
Context: ...## Basic example ### Overview In this example we use Puppeteer to log out the title o...

(DURING_THAT_TIME_COMMA)


### Testing your task

There's no payload required for this task so you can just click "Run test" from the Test page in the dashboard.

## Generate a PDF from a web page

### Overview

In this example we use Puppeteer to generate a PDF from a web page and upload it to Cloudflare R2.

### Task code

```ts trigger/puppeteer-generate-pdf.ts
import { logger, task } from "@trigger.dev/sdk/v3";
import puppeteer from "puppeteer";
import { PutObjectCommand, S3Client } from "@aws-sdk/client-s3";

// Initialize S3 client
const s3Client = new S3Client({
region: "auto",
endpoint: process.env.S3_ENDPOINT,
credentials: {
accessKeyId: process.env.R2_ACCESS_KEY_ID ?? "",
secretAccessKey: process.env.R2_SECRET_ACCESS_KEY ?? "",
},
});

export const puppeteerWebpageToPDF = task({
id: "puppeteer-webpage-to-pdf",
run: async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const response = await page.goto("https://google.com");
const url = response?.url() ?? "No URL found";

// Generate PDF from the web page
const generatePdf = await page.pdf();

logger.info("PDF generated from URL", { url });

await browser.close();

// Upload to R2
const s3Key = `pdfs/test.pdf`;
const uploadParams = {
Bucket: process.env.S3_BUCKET,
Key: s3Key,
Body: generatePdf,
ContentType: "application/pdf",
};

logger.log("Uploading to R2 with params", uploadParams);

// Upload the PDF to R2 and return the URL.
await s3Client.send(new PutObjectCommand(uploadParams));
const s3Url = `https://${process.env.S3_BUCKET}.s3.amazonaws.com/${s3Key}`;
logger.log("PDF uploaded to R2", { url: s3Url });
return { pdfUrl: s3Url };
},
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling and resource management

The PDF generation example is well-structured but could benefit from improved error handling and resource management.

Consider the following improvements:

  1. Wrap the main logic in a try-catch block to handle potential errors.
  2. Ensure the browser is closed even if an error occurs.
  3. Use environment variables more safely with fallback values.

Here's a suggested refactor:

 export const puppeteerWebpageToPDF = task({
   id: "puppeteer-webpage-to-pdf",
   run: async () => {
+    let browser;
+    try {
-      const browser = await puppeteer.launch();
+      browser = await puppeteer.launch({
+        headless: "new",
+        args: ['--no-sandbox', '--disable-setuid-sandbox']
+      });
       const page = await browser.newPage();
       const response = await page.goto("https://google.com");
       const url = response?.url() ?? "No URL found";

       // Generate PDF from the web page
       const generatePdf = await page.pdf();

       logger.info("PDF generated from URL", { url });

-      await browser.close();

       // Upload to R2
       const s3Key = `pdfs/test.pdf`;
       const uploadParams = {
-        Bucket: process.env.S3_BUCKET,
+        Bucket: process.env.S3_BUCKET ?? '',
         Key: s3Key,
         Body: generatePdf,
         ContentType: "application/pdf",
       };

       logger.log("Uploading to R2 with params", uploadParams);

       // Upload the PDF to R2 and return the URL.
       await s3Client.send(new PutObjectCommand(uploadParams));
-      const s3Url = `https://${process.env.S3_BUCKET}.s3.amazonaws.com/${s3Key}`;
+      const s3Url = `https://${process.env.S3_BUCKET ?? ''}.s3.amazonaws.com/${s3Key}`;
       logger.log("PDF uploaded to R2", { url: s3Url });
       return { pdfUrl: s3Url };
+    } catch (error) {
+      logger.error("Error in puppeteerWebpageToPDF", { error });
+      throw error;
+    } finally {
+      if (browser) {
+        await browser.close();
+      }
+    }
   },
 });

These changes improve error handling, ensure proper resource cleanup, and make the code more robust against potential issues with environment variables.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Generate a PDF from a web page
### Overview
In this example we use Puppeteer to generate a PDF from a web page and upload it to Cloudflare R2.
### Task code
```ts trigger/puppeteer-generate-pdf.ts
import { logger, task } from "@trigger.dev/sdk/v3";
import puppeteer from "puppeteer";
import { PutObjectCommand, S3Client } from "@aws-sdk/client-s3";
// Initialize S3 client
const s3Client = new S3Client({
region: "auto",
endpoint: process.env.S3_ENDPOINT,
credentials: {
accessKeyId: process.env.R2_ACCESS_KEY_ID ?? "",
secretAccessKey: process.env.R2_SECRET_ACCESS_KEY ?? "",
},
});
export const puppeteerWebpageToPDF = task({
id: "puppeteer-webpage-to-pdf",
run: async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const response = await page.goto("https://google.com");
const url = response?.url() ?? "No URL found";
// Generate PDF from the web page
const generatePdf = await page.pdf();
logger.info("PDF generated from URL", { url });
await browser.close();
// Upload to R2
const s3Key = `pdfs/test.pdf`;
const uploadParams = {
Bucket: process.env.S3_BUCKET,
Key: s3Key,
Body: generatePdf,
ContentType: "application/pdf",
};
logger.log("Uploading to R2 with params", uploadParams);
// Upload the PDF to R2 and return the URL.
await s3Client.send(new PutObjectCommand(uploadParams));
const s3Url = `https://${process.env.S3_BUCKET}.s3.amazonaws.com/${s3Key}`;
logger.log("PDF uploaded to R2", { url: s3Url });
return { pdfUrl: s3Url };
},
});
export const puppeteerWebpageToPDF = task({
id: "puppeteer-webpage-to-pdf",
run: async () => {
let browser;
try {
browser = await puppeteer.launch({
headless: "new",
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
const response = await page.goto("https://google.com");
const url = response?.url() ?? "No URL found";
// Generate PDF from the web page
const generatePdf = await page.pdf();
logger.info("PDF generated from URL", { url });
// Upload to R2
const s3Key = `pdfs/test.pdf`;
const uploadParams = {
Bucket: process.env.S3_BUCKET ?? '',
Key: s3Key,
Body: generatePdf,
ContentType: "application/pdf",
};
logger.log("Uploading to R2 with params", uploadParams);
// Upload the PDF to R2 and return the URL.
await s3Client.send(new PutObjectCommand(uploadParams));
const s3Url = `https://${process.env.S3_BUCKET ?? ''}.s3.amazonaws.com/${s3Key}`;
logger.log("PDF uploaded to R2", { url: s3Url });
return { pdfUrl: s3Url };
} catch (error) {
logger.error("Error in puppeteerWebpageToPDF", { error });
throw error;
} finally {
if (browser) {
await browser.close();
}
}
},
});
🧰 Tools
LanguageTool

[typographical] ~81-~81: It appears that a comma is missing.
Context: ... from a web page ### Overview In this example we use Puppeteer to generate a PDF from...

(DURING_THAT_TIME_COMMA)

```

### Testing your task

There's no payload required for this task so you can just click "Run test" from the Test page in the dashboard.

## Scrape content from a web page

### Overview

In this example we use Puppeteer with a BrowserBase proxy to scrape the GitHub stars count from the [Trigger.dev](https://trigger.dev) landing page and log it out.

<ScrapingWarning />

### Task code

```ts trigger/scrape-website.ts
import { logger, task } from "@trigger.dev/sdk/v3";
import puppeteer from "puppeteer-core";

export const puppeteerScrapeWithProxy = task({
id: "puppeteer-scrape-with-proxy",
run: async () => {
const browser = await puppeteer.connect({
browserWSEndpoint: `wss://connect.browserbase.com?apiKey=${process.env.BROWSERBASE_API_KEY}`,
});

const page = await browser.newPage();

// Set up BrowserBase proxy authentication
await page.authenticate({
username: "api",
password: process.env.BROWSERBASE_API_KEY || "",
});

try {
// Navigate to the target website
await page.goto("https://trigger.dev", { waitUntil: "networkidle0" });

// Scrape the GitHub stars count
const starCount = await page.evaluate(() => {
const starElement = document.querySelector(".github-star-count");
const text = starElement?.textContent ?? "0";
const numberText = text.replace(/[^0-9]/g, "");
return parseInt(numberText);
});

logger.info("GitHub star count", { starCount });

return { starCount };
} catch (error) {
logger.error("Error during scraping", {
error: error instanceof Error ? error.message : String(error),
});
throw error;
} finally {
await browser.close();
}
},
});
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve star count extraction logic

The web scraping example is well-structured with good error handling. However, the star count extraction logic could be more robust.

Consider modifying the star count extraction logic to handle potential formatting variations:

       const starCount = await page.evaluate(() => {
         const starElement = document.querySelector(".github-star-count");
-        const text = starElement?.textContent ?? "0";
-        const numberText = text.replace(/[^0-9]/g, "");
-        return parseInt(numberText);
+        const text = starElement?.textContent?.trim() ?? "0";
+        const match = text.match(/^([\d,]+)/);
+        return match ? parseInt(match[1].replace(/,/g, '')) : 0;
       });

This change improves the extraction logic by:

  1. Trimming whitespace from the text content.
  2. Using a regex to match the first group of digits (including commas).
  3. Removing commas before parsing the integer.
  4. Returning 0 if no match is found.

These modifications make the extraction more resilient to different formatting styles of the star count.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Scrape content from a web page
### Overview
In this example we use Puppeteer with a BrowserBase proxy to scrape the GitHub stars count from the [Trigger.dev](https://trigger.dev) landing page and log it out.
<ScrapingWarning />
### Task code
```ts trigger/scrape-website.ts
import { logger, task } from "@trigger.dev/sdk/v3";
import puppeteer from "puppeteer-core";
export const puppeteerScrapeWithProxy = task({
id: "puppeteer-scrape-with-proxy",
run: async () => {
const browser = await puppeteer.connect({
browserWSEndpoint: `wss://connect.browserbase.com?apiKey=${process.env.BROWSERBASE_API_KEY}`,
});
const page = await browser.newPage();
// Set up BrowserBase proxy authentication
await page.authenticate({
username: "api",
password: process.env.BROWSERBASE_API_KEY || "",
});
try {
// Navigate to the target website
await page.goto("https://trigger.dev", { waitUntil: "networkidle0" });
// Scrape the GitHub stars count
const starCount = await page.evaluate(() => {
const starElement = document.querySelector(".github-star-count");
const text = starElement?.textContent ?? "0";
const numberText = text.replace(/[^0-9]/g, "");
return parseInt(numberText);
});
logger.info("GitHub star count", { starCount });
return { starCount };
} catch (error) {
logger.error("Error during scraping", {
error: error instanceof Error ? error.message : String(error),
});
throw error;
} finally {
await browser.close();
}
},
});
```
## Scrape content from a web page
### Overview
In this example we use Puppeteer with a BrowserBase proxy to scrape the GitHub stars count from the [Trigger.dev](https://trigger.dev) landing page and log it out.
<ScrapingWarning />
### Task code
```ts trigger/scrape-website.ts
import { logger, task } from "@trigger.dev/sdk/v3";
import puppeteer from "puppeteer-core";
export const puppeteerScrapeWithProxy = task({
id: "puppeteer-scrape-with-proxy",
run: async () => {
const browser = await puppeteer.connect({
browserWSEndpoint: `wss://connect.browserbase.com?apiKey=${process.env.BROWSERBASE_API_KEY}`,
});
const page = await browser.newPage();
// Set up BrowserBase proxy authentication
await page.authenticate({
username: "api",
password: process.env.BROWSERBASE_API_KEY || "",
});
try {
// Navigate to the target website
await page.goto("https://trigger.dev", { waitUntil: "networkidle0" });
// Scrape the GitHub stars count
const starCount = await page.evaluate(() => {
const starElement = document.querySelector(".github-star-count");
const text = starElement?.textContent?.trim() ?? "0";
const match = text.match(/^([\d,]+)/);
return match ? parseInt(match[1].replace(/,/g, '')) : 0;
});
logger.info("GitHub star count", { starCount });
return { starCount };
} catch (error) {
logger.error("Error during scraping", {
error: error instanceof Error ? error.message : String(error),
});
throw error;
} finally {
await browser.close();
}
},
});
```
🧰 Tools
LanguageTool

[typographical] ~144-~144: It appears that a comma is missing.
Context: ... from a web page ### Overview In this example we use Puppeteer with a BrowserBase pro...

(DURING_THAT_TIME_COMMA)


### Testing your task

There's no payload required for this task so you can just click "Run test" from the Test page in the dashboard.

<LocalDevelopment packages={"the Puppeteer library."} />

## Proxying

If you're using Trigger.dev Cloud and Puppeteer or any other tool to scrape content from websites you don't own, you'll need to proxy your requests. **If you don't you'll risk getting our IP address blocked and we will ban you from our service.**

Here are a list of proxy services we recommend:

- [Browserbase](https://www.browserbase.com/)
- [Brightdata](https://brightdata.com/)
- [Browserless](https://browserless.io/)
- [Oxylabs](https://oxylabs.io/)
- [ScrapingBee](https://scrapingbee.com/)
- [Smartproxy](https://smartproxy.com/)
1 change: 1 addition & 0 deletions docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,7 @@
"examples/ffmpeg-video-processing",
"examples/open-ai-with-retrying",
"examples/pdf-to-image",
"examples/puppeteer",
"examples/sharp-image-processing",
"examples/react-pdf",
"examples/resend-email-sequence",
Expand Down
3 changes: 3 additions & 0 deletions docs/snippets/web-scraping-warning.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<Warning>
**WEB SCRAPING WARNING:** Direct scraping of third-party websites without explicit permission using Trigger.dev Cloud is strictly prohibited and will result in immediate account suspension. If web scraping is necessary for your project, you MUST use a proxy service to comply with our terms of service.
</Warning>