fix: lower 404 ttl to decrease end user failures #1344
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TLDR
Amazon Cloudfront HTTP caching of false-negative CID lookups is DoS-ing all Saturn L1s using Lassie.
Context for users trying to access content via ipfs.io gateway
Any hiccup in content routing of a CID is cached for 5 minutes, no L11 can retrieve it, gateway can't return it.
Context for users (developers) running their own IPFS node / trying IPFS for the first time
Proposed Changes
Lowering cloudfront cache TTL for 404 errors to 5s will fix false-negative content routing errors for end users.
It should still protect you from unwanted load spikes, but the end user will be able to refresh the page without waiting 5 minutes to see their content.
Happy to discuss other values, but 5 minutes is way too high:
majority of users wont wait and retry after 5 minutes, they will just give up on IPFS.
Tests
Announcing a single block on DHT and then asking indexer for it sometimes produces 404, and that is cached for 5 minutes, artificially breaking content routing resolution for that CID on Rhea.
Revert Strategy
You can always undo this 1 line change 🤷