(2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24

rivernews · 2022-08-21T22:21:49Z

S3 Notification

In #23 we already did some S3 notification using eventBridge. We want to do the same now for metadata JSON, but it matters a lot where we store this JSON.

Recall landing pages are store like s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing.html. We set prefix filtering & stops at daily-headlines.

If we store JSON at same dir as landing page, like s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing_meta.json, it makes sense for a human, but there's no unique way to filter by a fixed prefix. Or filter by suffix. Not possible.

A workaround could be share the same prefix with landing page filter ...daily-headlines, and do the advance filtering in your lambda. So yes you can't have unique rule separately for landing page and metadata JSON, BUT the purpose is served - either case, a notification is generated. You just need to do some routing in your lambda, resources are still not wasted.

Filtering stories

Because some are not meaningful and we want to exclude. May consider filter at metadata.json generating phase actually -> now added to metadata.json logic, commit in b6851c5

Storing stories

We decided to store stories in its own "store", so like s3://media-literacy-archives/{redacted}/stories/story-IDorTitle/title.html. Then, under that dir you can store story parsing metadata. It can include story update history, etc, first shown up in landing page in what date.

Metadata.json triggers skeleton computing env: completed by commit a62c33d
Fetch story: Fetch stories (S3 event driven) #29 - e29fdc8
Archive story in appropriate S3 dir
Scraping courtesy
- Parallism. We may do IP inspection - make sure we're using different IP; but parallism probably already equals unique IP so no need to confirm. Completed by d2b025e
- Randomize time duration. Completed by ba830cd.
  - A note: we may consider a pre-computed request distribution, perhaps having a step before map in sfn to do this and assign a wait time. Of course we may have to factor in the batch size though.
  - This will summarize PR Fetch stories (S3 event driven) #29
(optional) Add a final step at sfn for closure - telling all randomized works are done. Could provide some stats too.
(optional) Clean up - remove slack command batch fetch stories

The text was updated successfully, but these errors were encountered:

#24

rivernews · 2022-09-17T08:58:43Z

We decide to use Sfn - map for Scraping courtesy. A bit more expensive, but we'll have more IP options.

#24

rivernews · 2022-09-18T01:45:46Z

ba830cd Marks the last requirements. Enhancement remains undone, but can be revisit later.

* temp store all * remove go_poc * upgrade so project runs on M1 * Try S3 notification * Fix prefix to include newssite alias * Fix aws lambda PathError issue * Save to metadata.json complete * add untitled stories in metadata.json * rename stories function to landing_metadata * rename batch stories fetch tf to metadata * Improved metadata access s3 event * Metadata.json trigger computing env * read parse metadata.json * fetch a story POC #24 * Sfn map parallism POC #24 * randomize requests * Refactor to allow individual tf modules address #25 (comment) * scaffold table * draft table design * create table * Draining mechanism draft - identify all TODOs #25 (comment) * Draft for put landing page; identified TODOs Issue: #25 * Complete tf surgery; Identify all TODOs in golang For #25 * fix compile error; progress in metadata cronjob add query * Ready to test * Fix db field first char not lowercase Tracked by #25 (comment) * Fix permission of db index, S3 pull Tracked by #25 (comment) * All tests complete Tracked by #25 (comment) * Move landing PutItem out to s3 trigger lambda; ready for S3 batch move * create reusable lambda module; optimize package size #25 (comment) * Fix golang build path * Refactor to use our custom lambda module * add landing s3 trigger * rm golang module stories that are renamed * Fix env var * Fix permission for PutItem move from landing to s3 trigger * Fix metadata s3 trigger not fired * Fix s3 trigger not working - S3 notification can only have one resource * Make it easier to test * prod grade setting enabled * In Sfn pin lambda version, so rolling deploy works better for lambda * Display sfn map result / target stories count info in finalizer * stop landing s3 trigger from sending slack logs Fixes #40 * Let Sfn pin lambda version Fixes #39 * improve log for metadata trigger * improve cronjob log * log cronjob event for better understanding of how it get triggered * Disable cronjob to better debug Fixes #43 * workaround to scale up our Sfn pipeline Fix #44 * improve log for landing S3 trigger * re-enable prod config plus cronjob

rivernews mentioned this issue Aug 21, 2022

Fetch individual story pages #15

Closed

19 tasks

rivernews changed the title ~~Create landing metadata JSON trigger - invoke lambda - fetch stories~~ (2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories Aug 21, 2022

rivernews added a commit that referenced this issue Sep 17, 2022

fetch a story POC

e29fdc8

#24

rivernews added a commit that referenced this issue Sep 17, 2022

Sfn map parallism POC

d2b025e

#24

rivernews closed this as completed Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24

(2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24

rivernews commented Aug 21, 2022 •

edited

Loading

rivernews commented Sep 17, 2022 •

edited

Loading

rivernews commented Sep 18, 2022

(2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24

(2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24

Comments

rivernews commented Aug 21, 2022 • edited Loading

S3 Notification

Filtering stories

Storing stories

rivernews commented Sep 17, 2022 • edited Loading

rivernews commented Sep 18, 2022

rivernews commented Aug 21, 2022 •

edited

Loading

rivernews commented Sep 17, 2022 •

edited

Loading