NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state #71
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Via @obrie :
https://jira.unity3d.com/browse/NPE-475
Recently we've been receiving the follow error occasionally in production:
Aws::Errors::MissingCredentialsError: unable to sign request without credentials set
This happens when we fail to receive instance profile credentials. The problem that we're experiencing is that this is unrecoverable. The AWS client has gotten into a state where it will never attempt to retrieve proper instance profile credentials.
Solution
The solution being proposed here involves 2 changes:
We've changed the defaults to be 5 second intervals with up to 5 attempts. This gives us about 30 seconds of retries before chore will shut down.
Previously, chore would only shut down when we encountered an error like
Aws::SQS::Errors::NonExistentQueue
(an unrecoverable error). In this current proposal, we're suggesting that chore should shut down if we're unable to access the queue after all instance profile credential retries and standard http retries have been exhausted. This means that if we fail to get credentials or fail to look up the queue on startup, chore will shut down.While we theoretically could have chore reset the connection and keep retrying, the proposal is that we keep that retry logic out of chore and rely on the operating system to restart to chore based on standard application/pid monitoring logic that's in place via upstart/systemd/etc.
Testing
It's a little challenging to reproduce the errors in production. I've attempted to get an instance throttled by spinning up thousands of threads that attempt to grab instance profile credentials and look up the queue, but I haven't been able to get AWS to throttle me.
The only testing we can do is to kill a network connection and observe the impact on the ability for chore to start up.
Error tracing
Note that, with this change, exceptions that cause chore to shut down will still show up in NewRelic. This is because we've kept the connection verification logic to happen with the
handle_messages
method, which is whatchore-new_relic
wraps and tracks exceptions for.Previously, we had implemented and played around with the
verify_connection!
being invoked either from#initialize
or from the object that creates theConsumer
. However, this would hide errors from NewRelic and additionally produced complexity in specs. As a result, I felt it was safest to keep the logic where it was and just add better control over how errors should behave based on what we were doing within#handle_messages
.Links
Relevant links: