Lambda Scraper is an AWS Lambda function that: Scrapes any number of web pages you define, searching for new links on the page, and (optionally) filters the results by keyword. If it finds results, it sends an email to you via AWS SES.
Use it to add notification functionality to sites that don't natively have notifications. Use it to be notified of new product listings, cheap flights, job listings, or whatever you dream up.
The initial use case was as a Careers page scraper for my 👫, because it's hard out there for fashion students, and a lot of Careers pages offer no way of being notified of new postings. I'm not sure what else you might use this for, but if you come up with something good, let me know.
Experience working with AWS will be super handy if you want to set this up for yourself. If my instructions are unclear and you'd like to get this setup, ping me and I might put together a video walkthrough if you pressure me enough.
- Download this repo and create a
config.js
file (seeconfig.example.js
as an example of the initial format of this file). Go ahead and set theemail_to
(your email here),email_subject
, and AWS API key and secret. You'll set the other values in the next steps. - Create an S3 bucket, then enter the bucket name in
config.js
from step 1. - Create and verify an SES sending email or domain, then enter the sending email address (
email_from
) inconfig.js
. Set theaws_region
in yourconfig.js
to the region you used for your SES email. - Define the pages and their selectors in
page.js
(See below) - Locally, run
npm install
- Zip your project folder and upload to Lambda (See below)
As an example, if the HTML you're trying to scrape from a page (ie. https://example.com/listing
) looks like below:
<ul>
<li class="posting">
<h1>Hello world</h1>
<a class="more" href="http://example.com">Read more</a>
<span class="location">NYC</span>
</li>
<li class="posting">
<h1>Hello world</h1>
<a class="more" href="http://example.com">Read more</a>
<span class="location">NYC</span>
</li>
</ul>
In pages.js
, you'd enter:
{
url: "https://example.com/listing", // required
keywords: ["sales", "marketing", "fashion", "jewlery"], // optional
parent: ".posting", // required
selectors: {
title: "h1", // required
url: ".more", // required
location: ".location" // optional key/value (title/selector)
}
}
Quick note on images: If you enter a key of image
or thumbnail
, the selector must point to an img
tag.
For JSON endpoints, use $
as the top-level object within the parent
array
For example, given a JSON endpoint (ie. https://example.com/listing.json
) with a response of:
{
"postings": [{
"title": "Hello world",
"link": "http://example.com",
"location": { "title": "NYC" }
}]
}
In pages.js
you'd enter:
{
url: "https://example.com/listing.json",
json: true, // required for JSON
keywords: ["sales", "marketing", "fashion", "jewlery"],
parent: "postings",
selectors: {
title: "$.title",
url: "$.link",
center: "$.location.title"
}
}
Create the zip package
You can zip the package as you normally would, or you can run the zip
npm script:
$ npm install
$ npm run zip
(You can ignore the .env
file this creates in your root directory)
Create the Lambda function
- Start with the "canary" blueprint
- Create a CloudWatch event and set the rate (ie.
20 2 * * ? *
to run once a day at 2:20 UTC) - In the final review step of creating the Lambda function, make sure to enable the event source.
Lambda settings:
- Runtime: Node.js 4.3
- Handler:
index.handler
- Memory: 128mb should be enough
- Timeout: 2 minutes should be plenty (Mine hasn't gone beyond 15 seconds)
In index.js
, set var debug
to true
.
In your termainal, you can then run the below (from the root of the project) to see what results are found for the pages you've defined:
$ node
> require('./index').handler()
Found results for...