deployer doesn't handle deleted documents or redirects #2224

escattone · 2020-12-19T00:51:23Z

Sometimes, the content team will simply delete a document pages, with no added redirect. We don't handle this case with our current deployer. We need to compute the difference between what's currently in S3 with what's incoming, and if something is in S3 but isn't incoming, we need to delete it.

peterbe · 2021-01-08T18:43:18Z

When working on the Elasticsearch publishing, I have a flag so I can either --update or omit that.
If you omit it, it will delete the index and re-create it. This is "bad" for those unlucky few who might be doing searches in that very moment. It might actually be left "broken" for a matter of (single digit) minutes.
If you use --update instead, it just "cakes on" and anything that's now been deleted would still be in the index :(

One idea I had was to do something like this:

if today().day != 1:
  command += '--update'

so that once a month it omits the --update and thus reduces the probability of searching an empty-being-updated index by ~30x. :)

Another option would be to use the _redirects.txt.
You parse all the URLs from the _redirects.txt files and any URL in there (that wasn't found as a index.json) should now be deleted. So perhaps you can trigger a bulk deletion based on that subset. But, that feels overly complicated.

peterbe · 2021-01-08T19:31:25Z

Ok. I did a bit more research into this.
I opened two terminals. In one of them I was ready to run the poetry run deployer search-index command which starts by deleting (and recreating) the index.
In the other terminal I execute a query. I start the re-indexing and quickly move over to the other terminal to do a search as soon as possible. It works! You get a lot less results but you don't get errors.

It works because in Elasticsearch when you bulk index, it still has a refresh rate (it's 1 second by default) which and it means that you can basically do searches after 1 second of indexing.

One thing we could do is to make sure, right after having deleted and recreated the index, is to bulk index the most important documents first. I.e. the en-us/web folder. That means that things like sv-se/mozilla gets reindexed last.

nschonni · 2021-01-10T20:40:07Z

Was looking for "S3 rsync", and came across https://rclone.org/. Possible to check and/or sync with the bucket to prune out the deleted files

peterbe · 2021-01-13T18:06:42Z

mdn/content#1240 is the first time this has now bitten us. That plus the surprising effect of the _samples_/index.html that stopped being generated.
Perhaps time to deal with this. There are two possible solutions:

Use S3 life-cycles to have S3 automatically delete things that haven't been written/updated recently. This is risky because tings like index.json files might not need to change for a looong time and that looong time might be sufficient for people to notice "Hey! This should have been deleted by now!"
Do some sort of diff in the deployer that compares the files that existed vs. those now. The big challenge there is how do you train the Deployer to recognized that "en-us/docs/mozilla/archive/xul/legacy" wasn't created this time because it wasn't supposed to be.

Both scare me.

Another potentially interesting solution is to move the onus towards the Yari CLI tool for deleting. It could, add a row to a files/en-us/_deleted.txt file. I.e. running yarn content delete ... could trigger two events: 1) a git rm ... and an edit to the _deleted.txt file.

Fixes mdn#2224

peterbe · 2021-01-19T15:02:40Z

One challenge is that we currently don't build the archive every day. We build it "once a blue moon". Technically, I don't think we've built it since the day of the migration.
So, if we ignore redirects for a moment, there are about 60,000 S3 keys. 10k that came from mdn/archived-content, 40k that come from mdn/translated-content, 10k that came from mdn/content.
If the deployer first downloads a complete list of all known keys, it'll be a list of 60k keys. Then it uploads 10k en-US and 40k translated documents. Then it notices that there's 10k keys that were uploaded in the past but not this time!
One solution (what @fiji-flo suggested) is that we make an exception for archived content to make this work. We'd have to have a complete list of them for it to work since there's some (active) content that has the same prefix (e.g. /Mozilla/Developer_guide/) as the archived content (e.g. /Mozilla/Bugzilla/)

Another solution is to always build all of the archive every day :)

ghost · 2021-02-14T23:59:58Z

FTR
Is this the status code?
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/410

peterbe · 2021-02-22T16:19:30Z

By the way, I think it was discussed offline but the blocker was archived content. If you download a list of all existing S3 uploads (roughly 60k) of them, and then compare that with all the this-time built ones (roughly 50k). How do you know which wasn't built because it's considered archived? The solution is to read in and parse the archived.txt file.
E.g.

for file in get_all_files_in_s3():
   if is_built_now(file): 
      # Keep because it got built
      continue
  if is_in_archived_txt(file):
      # Keep because it's earmarked as archived content
      continue
  s3.delete(file)

Part of mdn#2224

* optionally delete from S3 what was NOT uploaded Part of #2224 * Apply suggestions from code review Co-authored-by: Ryan Johnson <[email protected]> * more feedbacked * rename the option properly * python lint * Revert "python lint" This reverts commit 4bfd17c. Co-authored-by: Ryan Johnson <[email protected]>

peterbe · 2021-03-16T21:33:03Z

We forgot to respect the archived files.
E.g. https://developer.allizom.org/en-US/docs/Mozilla/Projects/Rhino/Documentation should NOT have been deleted on Stage.

peterbe · 2021-03-16T21:34:20Z

We forgot to respect the archived files.
E.g. https://developer.allizom.org/en-US/docs/Mozilla/Projects/Rhino/Documentation should NOT have been deleted on Stage.

Note, en-us/mozilla/projects/rhino/documentation/index.html is found in content/archived.txt and we should take that into account.

peterbe · 2021-03-16T21:35:03Z

@escattone Can you add a headless test that checks that some archived URLs are still 200 OKing.
And it should have caught us on developer.allizom.org.

Part of mdn#2224

* --prune should leave archived S3 keys Part of #2224 * Update deployer/src/deployer/upload.py Co-authored-by: Ryan Johnson <[email protected]> * Update deployer/src/deployer/upload.py Co-authored-by: Ryan Johnson <[email protected]> * parse_archived_txt_file as a generator * only one 1 set * add a nice warning about --prune without --archived-files Co-authored-by: Ryan Johnson <[email protected]>

Elchi3 · 2021-03-22T09:56:49Z

Subscribing myself to this issue.
I deleted quite a few pages but they still show up (non-exhaustive list but to give you examples):

peterbe · 2021-03-22T19:17:56Z

I deleted quite a few pages but they still show up

The only thing we haven't done yet is to actually enable it on Prod.

For the record, I copied your list of URLs, replaced mozilla.org for allizom.org and checked each URL. They all 404'ed now.

peterbe · 2021-03-22T19:18:57Z

@escattone Would you mind doing the honors of adding --prune to prod-build.yml and kick off a build. Once Florian's list is all 404'ing and the archived pages are still working, then we can close this. ...finally :)

escattone · 2021-03-22T20:38:05Z

@peterbe Will do.

escattone · 2021-03-22T23:15:42Z

@Elchi3 @peterbe All of the URLs @Elchi3 listed in #2224 (comment) now return 404 (after I invalidated the CDN cache for each of them). The CDN cache may still have unexpired content for any other URLs that have been deleted, but all cached entries in the CDN expire after 24 hours, so at worst we'll have to wait 24 hours for those to return 404. Wow, 45,985 keys were deleted (not all were docs), so some long overdue housekeeping! (Total deleted keys: 45,985 -- https://github.com/mdn/yari/runs/2169933920?check_suite_focus=true).

@peterbe FYI, the archived content survived as expected (e.g., https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino/Documentation still works even after invalidating its cached entry -- and I also checked that its file, and others, still exist in S3).

* optionally delete from S3 what was NOT uploaded Part of mdn#2224 * Apply suggestions from code review Co-authored-by: Ryan Johnson <[email protected]> * more feedbacked * rename the option properly * python lint * Revert "python lint" This reverts commit 4bfd17c. Co-authored-by: Ryan Johnson <[email protected]>

* --prune should leave archived S3 keys Part of mdn#2224 * Update deployer/src/deployer/upload.py Co-authored-by: Ryan Johnson <[email protected]> * Update deployer/src/deployer/upload.py Co-authored-by: Ryan Johnson <[email protected]> * parse_archived_txt_file as a generator * only one 1 set * add a nice warning about --prune without --archived-files Co-authored-by: Ryan Johnson <[email protected]>

escattone self-assigned this Dec 19, 2020

peterbe mentioned this issue Dec 21, 2020

Remove dropzone content mdn/content#514

Merged

peterbe mentioned this issue Jan 13, 2021

Issue with "Guidelines to the MDN editor UI": Mention that MDN WYSIWYG editor is no longer available mdn/content#1240

Closed

escattone added the p1 We will address this soon and will provide capacity from our team for it in the next few releases. label Jan 14, 2021

peterbe added a commit to peterbe/yari that referenced this issue Jan 19, 2021

a _deleted.txt file in CLI

34788a0

Fixes mdn#2224

peterbe mentioned this issue Jan 19, 2021

generate a _deleted.txt file in CLI #2589

Closed

escattone mentioned this issue Jan 21, 2021

Let's build all en-us docs every 6h instead of 24h #2642

Closed

This was referenced Jan 26, 2021

Window API articles that exist in production site but have no source in mdn/content #2690

Closed

Issue with "RSS feeds": All links are dead mdn/content#1877

Closed

Elchi3 mentioned this issue Feb 15, 2021

Deleted pages and redirects with HTMLHyperlinkElementUtils #2935

Closed

peterbe mentioned this issue Feb 16, 2021

build git-recent-hours #2908

Closed

4 tasks

escattone mentioned this issue Feb 18, 2021

Cannot find "Add translation" #2921

Closed

peterbe assigned peterbe and unassigned escattone Mar 2, 2021

peterbe added a commit to peterbe/yari that referenced this issue Mar 3, 2021

optionally delete from S3 what was NOT uploaded

ab30d96

Part of mdn#2224

peterbe mentioned this issue Mar 3, 2021

optionally delete from S3 what was NOT uploaded #3117

Merged

Elchi3 mentioned this issue Mar 10, 2021

Remove mixins that have been removed from MDN content #3131

Merged

escattone closed this as completed in #3117 Mar 11, 2021

peterbe reopened this Mar 16, 2021

peterbe added a commit to peterbe/yari that referenced this issue Mar 17, 2021

--prune should leave archived S3 keys

c8c6d76

Part of mdn#2224

peterbe mentioned this issue Mar 17, 2021

--prune should leave archived S3 keys #3266

Merged

escattone mentioned this issue Mar 22, 2021

start pruning prod builds #3302

Merged

escattone closed this as completed Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deployer doesn't handle deleted documents or redirects #2224

deployer doesn't handle deleted documents or redirects #2224

escattone commented Dec 19, 2020

peterbe commented Jan 8, 2021

peterbe commented Jan 8, 2021

nschonni commented Jan 10, 2021

peterbe commented Jan 13, 2021

peterbe commented Jan 19, 2021

ghost commented Feb 14, 2021 •

edited by ghost

Loading

peterbe commented Feb 22, 2021

peterbe commented Mar 16, 2021

peterbe commented Mar 16, 2021

peterbe commented Mar 16, 2021 •

edited

Loading

Elchi3 commented Mar 22, 2021

peterbe commented Mar 22, 2021

peterbe commented Mar 22, 2021

escattone commented Mar 22, 2021

escattone commented Mar 22, 2021 •

edited

Loading

deployer doesn't handle deleted documents or redirects #2224

deployer doesn't handle deleted documents or redirects #2224

Comments

escattone commented Dec 19, 2020

peterbe commented Jan 8, 2021

peterbe commented Jan 8, 2021

nschonni commented Jan 10, 2021

peterbe commented Jan 13, 2021

peterbe commented Jan 19, 2021

ghost commented Feb 14, 2021 • edited by ghost Loading

peterbe commented Feb 22, 2021

peterbe commented Mar 16, 2021

peterbe commented Mar 16, 2021

peterbe commented Mar 16, 2021 • edited Loading

Elchi3 commented Mar 22, 2021

peterbe commented Mar 22, 2021

peterbe commented Mar 22, 2021

escattone commented Mar 22, 2021

escattone commented Mar 22, 2021 • edited Loading

ghost commented Feb 14, 2021 •

edited by ghost

Loading

peterbe commented Mar 16, 2021 •

edited

Loading

escattone commented Mar 22, 2021 •

edited

Loading