-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deployer doesn't handle deleted documents or redirects #2224
Comments
When working on the Elasticsearch publishing, I have a flag so I can either One idea I had was to do something like this:
so that once a month it omits the Another option would be to use the |
Ok. I did a bit more research into this. It works because in Elasticsearch when you bulk index, it still has a refresh rate (it's 1 second by default) which and it means that you can basically do searches after 1 second of indexing. One thing we could do is to make sure, right after having deleted and recreated the index, is to bulk index the most important documents first. I.e. the |
Was looking for "S3 rsync", and came across https://rclone.org/. Possible to check and/or sync with the bucket to prune out the deleted files |
mdn/content#1240 is the first time this has now bitten us. That plus the surprising effect of the
Both scare me. Another potentially interesting solution is to move the onus towards the Yari CLI tool for deleting. It could, add a row to a |
One challenge is that we currently don't build the archive every day. We build it "once a blue moon". Technically, I don't think we've built it since the day of the migration. Another solution is to always build all of the archive every day :) |
FTR |
By the way, I think it was discussed offline but the blocker was archived content. If you download a list of all existing S3 uploads (roughly 60k) of them, and then compare that with all the this-time built ones (roughly 50k). How do you know which wasn't built because it's considered archived? The solution is to read in and parse the
|
* optionally delete from S3 what was NOT uploaded Part of #2224 * Apply suggestions from code review Co-authored-by: Ryan Johnson <[email protected]> * more feedbacked * rename the option properly * python lint * Revert "python lint" This reverts commit 4bfd17c. Co-authored-by: Ryan Johnson <[email protected]>
We forgot to respect the archived files. |
Note, |
@escattone Can you add a headless test that checks that some archived URLs are still 200 OKing. |
* --prune should leave archived S3 keys Part of #2224 * Update deployer/src/deployer/upload.py Co-authored-by: Ryan Johnson <[email protected]> * Update deployer/src/deployer/upload.py Co-authored-by: Ryan Johnson <[email protected]> * parse_archived_txt_file as a generator * only one 1 set * add a nice warning about --prune without --archived-files Co-authored-by: Ryan Johnson <[email protected]>
The only thing we haven't done yet is to actually enable it on Prod. For the record, I copied your list of URLs, replaced |
@escattone Would you mind doing the honors of adding |
@peterbe Will do. |
@Elchi3 @peterbe All of the URLs @Elchi3 listed in #2224 (comment) now return 404 (after I invalidated the CDN cache for each of them). The CDN cache may still have unexpired content for any other URLs that have been deleted, but all cached entries in the CDN expire after 24 hours, so at worst we'll have to wait 24 hours for those to return 404. Wow, 45,985 keys were deleted (not all were docs), so some long overdue housekeeping! ( @peterbe FYI, the archived content survived as expected (e.g., https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino/Documentation still works even after invalidating its cached entry -- and I also checked that its file, and others, still exist in S3). |
* optionally delete from S3 what was NOT uploaded Part of mdn#2224 * Apply suggestions from code review Co-authored-by: Ryan Johnson <[email protected]> * more feedbacked * rename the option properly * python lint * Revert "python lint" This reverts commit 4bfd17c. Co-authored-by: Ryan Johnson <[email protected]>
* --prune should leave archived S3 keys Part of mdn#2224 * Update deployer/src/deployer/upload.py Co-authored-by: Ryan Johnson <[email protected]> * Update deployer/src/deployer/upload.py Co-authored-by: Ryan Johnson <[email protected]> * parse_archived_txt_file as a generator * only one 1 set * add a nice warning about --prune without --archived-files Co-authored-by: Ryan Johnson <[email protected]>
Sometimes, the content team will simply delete a document pages, with no added redirect. We don't handle this case with our current deployer. We need to compute the difference between what's currently in S3 with what's incoming, and if something is in S3 but isn't incoming, we need to delete it.
The text was updated successfully, but these errors were encountered: