Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state #71

Merged
merged 3 commits into from
Oct 29, 2024

Conversation

alanbrent
Copy link
Member

@alanbrent alanbrent commented Oct 24, 2024

Via @obrie :

https://jira.unity3d.com/browse/NPE-475

Recently we've been receiving the follow error occasionally in production: Aws::Errors::MissingCredentialsError: unable to sign request without credentials set

This happens when we fail to receive instance profile credentials. The problem that we're experiencing is that this is unrecoverable. The AWS client has gotten into a state where it will never attempt to retrieve proper instance profile credentials.

Solution

The solution being proposed here involves 2 changes:

  1. Allow instance profile credential requests to be retried. The default is no retries. This means that if the service is temporarily unavailable or we've hit some form of a throttle error, the client becomes unrecoverable.

We've changed the defaults to be 5 second intervals with up to 5 attempts. This gives us about 30 seconds of retries before chore will shut down.

  1. Any issue with attempting to access the queue metadata on startup is considered a permanent failure and will shut down chore.

Previously, chore would only shut down when we encountered an error like Aws::SQS::Errors::NonExistentQueue (an unrecoverable error). In this current proposal, we're suggesting that chore should shut down if we're unable to access the queue after all instance profile credential retries and standard http retries have been exhausted. This means that if we fail to get credentials or fail to look up the queue on startup, chore will shut down.

While we theoretically could have chore reset the connection and keep retrying, the proposal is that we keep that retry logic out of chore and rely on the operating system to restart to chore based on standard application/pid monitoring logic that's in place via upstart/systemd/etc.

Testing

It's a little challenging to reproduce the errors in production. I've attempted to get an instance throttled by spinning up thousands of threads that attempt to grab instance profile credentials and look up the queue, but I haven't been able to get AWS to throttle me.

The only testing we can do is to kill a network connection and observe the impact on the ability for chore to start up.

Error tracing

Note that, with this change, exceptions that cause chore to shut down will still show up in NewRelic. This is because we've kept the connection verification logic to happen with the handle_messages method, which is what chore-new_relic wraps and tracks exceptions for.

Previously, we had implemented and played around with the verify_connection! being invoked either from #initialize or from the object that creates the Consumer. However, this would hide errors from NewRelic and additionally produced complexity in specs. As a result, I felt it was safest to keep the logic where it was and just add better control over how errors should behave based on what we were doing within #handle_messages.

Links

Relevant links:

@obrie obrie force-pushed the brent/creds-errors branch from 7c2c181 to 8680f00 Compare October 29, 2024 15:33
@obrie obrie changed the title Verify connection on startup NPE-495: Fix AWS instance credential failures leaving Chore in an unrecoverable state Oct 29, 2024
@obrie obrie marked this pull request as ready for review October 29, 2024 15:35
@obrie obrie requested a review from a team as a code owner October 29, 2024 15:35
@obrie obrie force-pushed the brent/creds-errors branch from 8680f00 to f3cbfc5 Compare October 29, 2024 15:54
@obrie obrie changed the title NPE-495: Fix AWS instance credential failures leaving Chore in an unrecoverable state NPE-475: Fix AWS instance credential failures leaving Chore in an unrecoverable state Oct 29, 2024
@obrie obrie merged commit 41447c3 into master Oct 29, 2024
8 checks passed
@obrie obrie deleted the brent/creds-errors branch October 29, 2024 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants