Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployer doesn't handle deleted documents or redirects #2224

Closed
escattone opened this issue Dec 19, 2020 · 15 comments · Fixed by #3117
Closed

deployer doesn't handle deleted documents or redirects #2224

escattone opened this issue Dec 19, 2020 · 15 comments · Fixed by #3117
Assignees
Labels
p1 We will address this soon and will provide capacity from our team for it in the next few releases.

Comments

@escattone
Copy link
Contributor

Sometimes, the content team will simply delete a document pages, with no added redirect. We don't handle this case with our current deployer. We need to compute the difference between what's currently in S3 with what's incoming, and if something is in S3 but isn't incoming, we need to delete it.

@peterbe
Copy link
Contributor

peterbe commented Jan 8, 2021

When working on the Elasticsearch publishing, I have a flag so I can either --update or omit that.
If you omit it, it will delete the index and re-create it. This is "bad" for those unlucky few who might be doing searches in that very moment. It might actually be left "broken" for a matter of (single digit) minutes.
If you use --update instead, it just "cakes on" and anything that's now been deleted would still be in the index :(

One idea I had was to do something like this:

if today().day != 1:
  command += '--update'

so that once a month it omits the --update and thus reduces the probability of searching an empty-being-updated index by ~30x. :)

Another option would be to use the _redirects.txt.
You parse all the URLs from the _redirects.txt files and any URL in there (that wasn't found as a index.json) should now be deleted. So perhaps you can trigger a bulk deletion based on that subset. But, that feels overly complicated.

@peterbe
Copy link
Contributor

peterbe commented Jan 8, 2021

Ok. I did a bit more research into this.
I opened two terminals. In one of them I was ready to run the poetry run deployer search-index command which starts by deleting (and recreating) the index.
In the other terminal I execute a query. I start the re-indexing and quickly move over to the other terminal to do a search as soon as possible. It works! You get a lot less results but you don't get errors.

It works because in Elasticsearch when you bulk index, it still has a refresh rate (it's 1 second by default) which and it means that you can basically do searches after 1 second of indexing.

One thing we could do is to make sure, right after having deleted and recreated the index, is to bulk index the most important documents first. I.e. the en-us/web folder. That means that things like sv-se/mozilla gets reindexed last.

@nschonni
Copy link
Collaborator

Was looking for "S3 rsync", and came across https://rclone.org/. Possible to check and/or sync with the bucket to prune out the deleted files

@peterbe
Copy link
Contributor

peterbe commented Jan 13, 2021

mdn/content#1240 is the first time this has now bitten us. That plus the surprising effect of the _samples_/index.html that stopped being generated.
Perhaps time to deal with this. There are two possible solutions:

  1. Use S3 life-cycles to have S3 automatically delete things that haven't been written/updated recently. This is risky because tings like index.json files might not need to change for a looong time and that looong time might be sufficient for people to notice "Hey! This should have been deleted by now!"
  2. Do some sort of diff in the deployer that compares the files that existed vs. those now. The big challenge there is how do you train the Deployer to recognized that "en-us/docs/mozilla/archive/xul/legacy" wasn't created this time because it wasn't supposed to be.

Both scare me.

Another potentially interesting solution is to move the onus towards the Yari CLI tool for deleting. It could, add a row to a files/en-us/_deleted.txt file. I.e. running yarn content delete ... could trigger two events: 1) a git rm ... and an edit to the _deleted.txt file.

@escattone escattone added the p1 We will address this soon and will provide capacity from our team for it in the next few releases. label Jan 14, 2021
peterbe added a commit to peterbe/yari that referenced this issue Jan 19, 2021
@peterbe
Copy link
Contributor

peterbe commented Jan 19, 2021

One challenge is that we currently don't build the archive every day. We build it "once a blue moon". Technically, I don't think we've built it since the day of the migration.
So, if we ignore redirects for a moment, there are about 60,000 S3 keys. 10k that came from mdn/archived-content, 40k that come from mdn/translated-content, 10k that came from mdn/content.
If the deployer first downloads a complete list of all known keys, it'll be a list of 60k keys. Then it uploads 10k en-US and 40k translated documents. Then it notices that there's 10k keys that were uploaded in the past but not this time!
One solution (what @fiji-flo suggested) is that we make an exception for archived content to make this work. We'd have to have a complete list of them for it to work since there's some (active) content that has the same prefix (e.g. /Mozilla/Developer_guide/) as the archived content (e.g. /Mozilla/Bugzilla/)

Another solution is to always build all of the archive every day :)

@ghost
Copy link

ghost commented Feb 14, 2021

FTR
Is this the status code?
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/410

@peterbe
Copy link
Contributor

peterbe commented Feb 22, 2021

By the way, I think it was discussed offline but the blocker was archived content. If you download a list of all existing S3 uploads (roughly 60k) of them, and then compare that with all the this-time built ones (roughly 50k). How do you know which wasn't built because it's considered archived? The solution is to read in and parse the archived.txt file.
E.g.

for file in get_all_files_in_s3():
   if is_built_now(file): 
      # Keep because it got built
      continue
  if is_in_archived_txt(file):
      # Keep because it's earmarked as archived content
      continue
  s3.delete(file)

@peterbe peterbe assigned peterbe and unassigned escattone Mar 2, 2021
peterbe added a commit to peterbe/yari that referenced this issue Mar 3, 2021
escattone added a commit that referenced this issue Mar 11, 2021
* optionally delete from S3 what was NOT uploaded

Part of #2224

* Apply suggestions from code review

Co-authored-by: Ryan Johnson <[email protected]>

* more feedbacked

* rename the option properly

* python lint

* Revert "python lint"

This reverts commit 4bfd17c.

Co-authored-by: Ryan Johnson <[email protected]>
@peterbe
Copy link
Contributor

peterbe commented Mar 16, 2021

We forgot to respect the archived files.
E.g. https://developer.allizom.org/en-US/docs/Mozilla/Projects/Rhino/Documentation should NOT have been deleted on Stage.

@peterbe peterbe reopened this Mar 16, 2021
@peterbe
Copy link
Contributor

peterbe commented Mar 16, 2021

We forgot to respect the archived files.
E.g. https://developer.allizom.org/en-US/docs/Mozilla/Projects/Rhino/Documentation should NOT have been deleted on Stage.

Note, en-us/mozilla/projects/rhino/documentation/index.html is found in content/archived.txt and we should take that into account.

@peterbe
Copy link
Contributor

peterbe commented Mar 16, 2021

@escattone Can you add a headless test that checks that some archived URLs are still 200 OKing.
And it should have caught us on developer.allizom.org.

peterbe added a commit to peterbe/yari that referenced this issue Mar 17, 2021
escattone added a commit that referenced this issue Mar 18, 2021
* --prune should leave archived S3 keys

Part of #2224

* Update deployer/src/deployer/upload.py

Co-authored-by: Ryan Johnson <[email protected]>

* Update deployer/src/deployer/upload.py

Co-authored-by: Ryan Johnson <[email protected]>

* parse_archived_txt_file as a generator

* only one 1 set

* add a nice warning about --prune without --archived-files

Co-authored-by: Ryan Johnson <[email protected]>
@peterbe
Copy link
Contributor

peterbe commented Mar 22, 2021

I deleted quite a few pages but they still show up

The only thing we haven't done yet is to actually enable it on Prod.

For the record, I copied your list of URLs, replaced mozilla.org for allizom.org and checked each URL. They all 404'ed now.

@peterbe
Copy link
Contributor

peterbe commented Mar 22, 2021

@escattone Would you mind doing the honors of adding --prune to prod-build.yml and kick off a build. Once Florian's list is all 404'ing and the archived pages are still working, then we can close this. ...finally :)

@escattone
Copy link
Contributor Author

@peterbe Will do.

@escattone
Copy link
Contributor Author

escattone commented Mar 22, 2021

@Elchi3 @peterbe All of the URLs @Elchi3 listed in #2224 (comment) now return 404 (after I invalidated the CDN cache for each of them). The CDN cache may still have unexpired content for any other URLs that have been deleted, but all cached entries in the CDN expire after 24 hours, so at worst we'll have to wait 24 hours for those to return 404. Wow, 45,985 keys were deleted (not all were docs), so some long overdue housekeeping! (Total deleted keys: 45,985 -- https://github.com/mdn/yari/runs/2169933920?check_suite_focus=true).

@peterbe FYI, the archived content survived as expected (e.g., https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino/Documentation still works even after invalidating its cached entry -- and I also checked that its file, and others, still exist in S3).

peterbe added a commit to peterbe/yari that referenced this issue Jun 1, 2021
* optionally delete from S3 what was NOT uploaded

Part of mdn#2224

* Apply suggestions from code review

Co-authored-by: Ryan Johnson <[email protected]>

* more feedbacked

* rename the option properly

* python lint

* Revert "python lint"

This reverts commit 4bfd17c.

Co-authored-by: Ryan Johnson <[email protected]>
peterbe added a commit to peterbe/yari that referenced this issue Jun 1, 2021
* --prune should leave archived S3 keys

Part of mdn#2224

* Update deployer/src/deployer/upload.py

Co-authored-by: Ryan Johnson <[email protected]>

* Update deployer/src/deployer/upload.py

Co-authored-by: Ryan Johnson <[email protected]>

* parse_archived_txt_file as a generator

* only one 1 set

* add a nice warning about --prune without --archived-files

Co-authored-by: Ryan Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 We will address this soon and will provide capacity from our team for it in the next few releases.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants