-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make failed job acquisitions return a specific exit code (27) #2762
Conversation
7a099ea
to
c907663
Compare
// specific exit code so that the caller can know that this job can't be acquired. | ||
|
||
const acquisitionFailedExitCode = 27 // chosen by fair dice roll | ||
return cli.NewExitError(err, acquisitionFailedExitCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I could find a better way to do this, It feels like the job of looking up an error and translating it to an exit code should be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean, if the return value of a cli.Command
is an ExitCoder
(ie, fulfils interface { ExitCode() int }
, this would happen automatically, but i actually think that that would be worse than what we have here - it would mean pulling a concern of the CLI into the agent
package. i think this is fine as is for now, and if we end up checking a whole bunch of error types, we can refactor as necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agreed, that's the conclusion I came to, but it was niggling at me
c907663
to
de3e8d1
Compare
// If the agent tried to acquire a job, but it couldn't because the job was already taken, we should exit with a | ||
// specific exit code so that the caller can know that this job can't be acquired. | ||
|
||
const acquisitionFailedExitCode = 27 // chosen by fair dice roll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a list somewhere of the other exit codes we have? i.e. aside from 0 and 1 — would 27 be the first "other one" we introduce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, AFAICT, we only ever return 0 or 1. the reason i didn't go with 2
is that it implies that this is the second most important exit code. because it's one of two non-zero exit codes, it is the second-most important exit code by definition, but were we to add a third, this would likely not be the case.
regardless, potentially it's a good idea to hoist this const definition out from this function and put it somewhere central-ish in the clicommand
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
Description
When the agent is running in acquire mode - that is, by specifying a job UUID for the agent to pick up, "inverting" the normal dispatch process - sometimes, the job it's been instructed to pick up in unavailable. Perhaps it's already running on another agent, or maybe the job was cancelled, but made its way to the agent anyway.
In these situations, the agent prints the error to stdout, and returns an exit code of
1
, indicating failure.However, acquisition can fail for both recoverable- and non-recoverable reasons, and the agent doesn't differentiate between these in its status code. Users can grep through the logs for things unrecoverable-sounding things, but this is way more work than it needs to be.
This PR makes it so that if acquiring the job fails in such a way that the error is unrecoverable, the agent will return a status code of
27
(chosen arbitrarily, because it's a nice number). This allows consumers of the agent to not try to pick up that job again.Context
COMP-332
https://docs.google.com/document/d/1qjIXw2gm88iiDggQAKVMQx0qwKntPjtlA4pW84om3xw/edit
Slack convo w/ Namespace
Testing
go test ./...
). Buildkite employees may check this if the pipeline has run automatically.go fmt ./...
)