NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state #71

alanbrent · 2024-10-24T18:57:37Z

https://jira.unity3d.com/browse/NPE-475

Recently we've been receiving the follow error occasionally in production: Aws::Errors::MissingCredentialsError: unable to sign request without credentials set

This happens when we fail to receive instance profile credentials. The problem that we're experiencing is that this is unrecoverable. The AWS client has gotten into a state where it will never attempt to retrieve proper instance profile credentials.

Solution

The solution being proposed here involves 2 changes:

Allow instance profile credential requests to be retried. The default is no retries. This means that if the service is temporarily unavailable or we've hit some form of a throttle error, the client becomes unrecoverable.

We've changed the defaults to be 5 second intervals with up to 5 attempts. This gives us about 30 seconds of retries before chore will shut down.

Any issue with attempting to access the queue metadata on startup is considered a permanent failure and will shut down chore.

Previously, chore would only shut down when we encountered an error like Aws::SQS::Errors::NonExistentQueue (an unrecoverable error). In this current proposal, we're suggesting that chore should shut down if we're unable to access the queue after all instance profile credential retries and standard http retries have been exhausted. This means that if we fail to get credentials or fail to look up the queue on startup, chore will shut down.

While we theoretically could have chore reset the connection and keep retrying, the proposal is that we keep that retry logic out of chore and rely on the operating system to restart to chore based on standard application/pid monitoring logic that's in place via upstart/systemd/etc.

Testing

It's a little challenging to reproduce the errors in production. I've attempted to get an instance throttled by spinning up thousands of threads that attempt to grab instance profile credentials and look up the queue, but I haven't been able to get AWS to throttle me.

The only testing we can do is to kill a network connection and observe the impact on the ability for chore to start up.

Error tracing

Note that, with this change, exceptions that cause chore to shut down will still show up in NewRelic. This is because we've kept the connection verification logic to happen with the handle_messages method, which is what chore-new_relic wraps and tracks exceptions for.

Previously, we had implemented and played around with the verify_connection! being invoked either from #initialize or from the object that creates the Consumer. However, this would hide errors from NewRelic and additionally produced complexity in specs. As a result, I felt it was safest to keep the logic where it was and just add better control over how errors should behave based on what we were doing within #handle_messages.

Links

Relevant links:

Aws::Errors::MissingCredentialsError: unable to sign request without credentials set aws/aws-sdk-ruby#1301 (comment)

…ntials

alanbrent force-pushed the brent/creds-errors branch from e4e3060 to 7c2c181 Compare October 24, 2024 18:57

obrie and others added 2 commits October 29, 2024 11:33

NPE-475: Allow for retries when failing to get instance profile crede…

176969e

…ntials

NPE-475: Change failed SQS connection checks to cause chore to shut down

64edbea

obrie force-pushed the brent/creds-errors branch from 7c2c181 to 8680f00 Compare October 29, 2024 15:33

obrie changed the title ~~Verify connection on startup~~ NPE-495: Fix AWS instance credential failures leaving Chore in an unrecoverable state Oct 29, 2024

obrie marked this pull request as ready for review October 29, 2024 15:35

obrie requested a review from a team as a code owner October 29, 2024 15:35

obrie requested review from geoffreyclark and ojameso October 29, 2024 15:48

NPE-475: Tag 4.7.0 release

f3cbfc5

obrie force-pushed the brent/creds-errors branch from 8680f00 to f3cbfc5 Compare October 29, 2024 15:54

obrie changed the title ~~NPE-495: Fix AWS instance credential failures leaving Chore in an unrecoverable state~~ NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state Oct 29, 2024

geoffreyclark approved these changes Oct 29, 2024

View reviewed changes

obrie merged commit 41447c3 into master Oct 29, 2024
8 checks passed

obrie deleted the brent/creds-errors branch October 29, 2024 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state #71

NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state #71

alanbrent commented Oct 24, 2024 •

edited by obrie

Loading

NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state #71

NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state #71

Conversation

alanbrent commented Oct 24, 2024 • edited by obrie Loading

Solution

Testing

Error tracing

Links

alanbrent commented Oct 24, 2024 •

edited by obrie

Loading