Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling is possible during deployment with bad results #23444

Closed
lattwood opened this issue Jun 26, 2024 · 5 comments · Fixed by #23457
Closed

Scaling is possible during deployment with bad results #23444

lattwood opened this issue Jun 26, 2024 · 5 comments · Fixed by #23457
Assignees
Labels
theme/autoscaling Issues related to supporting autoscaling theme/deployments type/bug
Milestone

Comments

@lattwood
Copy link
Contributor

lattwood commented Jun 26, 2024

Super Quick tl;dr;

Evaluations have a different sort order when sorting by CreateIndex vs CreateTime, and this has impacts on deployment of jobs that do scaling.

Nomad version

~# nomad version
Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

Relevant- we're using a shell script to query nomad, and then run various nomad job scale commands. The shell script also checks to see if a deployment is already in progress prior to scaling, but as you'll soon see due to the timeframes involved it wouldn't have helped.

Issue

When performing a deployment of a job that is constantly being scaled, instead of getting a failure to scale and the job scaling blocked due to active deployment error from Nomad, spooky things happen.

This is the spooky thing.

==> 2024-06-26T15:07:49Z: Monitoring evaluation "46e24c94"
    2024-06-26T15:07:49Z: Evaluation triggered by job "JOB-NAME"
    2024-06-26T15:07:50Z: Evaluation status changed: "pending" -> "canceled"
==> 2024-06-26T15:07:50Z: Evaluation "46e24c94" finished with status "canceled"

I go and take a look at the evaluation it created.

> nomad eval status -json 46e24c94
{
    "AnnotatePlan": false,
    "BlockedEval": "",
    "ClassEligibility": null,
    "CreateIndex": 87987021,
    "CreateTime": 1719414468345035507,
    "DeploymentID": "",
    "EscapedComputedClass": false,
    "FailedTGAllocs": null,
    "ID": "46e24c94-b015-0f3e-4374-ae5ffe59d202",
    "JobID": "JOB-NAME",
    "JobModifyIndex": 87987021,
    "ModifyIndex": 87987127,
    "ModifyTime": 1719414469680946947,
    "Namespace": "default",
    "NextEval": "",
    "NodeID": "",
    "NodeModifyIndex": 0,
    "PreviousEval": "",
    "Priority": 50,
    "QueuedAllocations": null,
    "QuotaLimitReached": "",
    "RelatedEvals": null,
    "SnapshotIndex": 0,
    "Status": "canceled",
    "StatusDescription": "canceled after more recent eval was processed",
    "TriggeredBy": "job-register",
    "Type": "service",
    "Wait": 0,
    "WaitUntil": null
}

Nothing out of the ordinary, save for the NextEval field being empty (would be nice if that was populated in this case). DeploymentID is empty but that might be expected with job-register, but anyways- I go into the Nomad UI armed with the timestamp of the evaluation's creation (CreateTime), go to the evaluation directly above it in the list and look at that.

> nomad eval status -json 42842e2e
{
    "AnnotatePlan": false,
    "BlockedEval": "",
    "ClassEligibility": null,
    "CreateIndex": 87987065,
    "CreateTime": 1719414468130257405,
    "DeploymentID": "7ccedd16-93b1-ac73-dd43-54fd218e0e58",
    "EscapedComputedClass": false,
    "FailedTGAllocs": null,
    "ID": "42842e2e-f6c5-79e9-f2c2-bbb9fb867443",
    "JobID": "JOB-NAME",
    "JobModifyIndex": 0,
    "ModifyIndex": 87987127,
    "ModifyTime": 1719414469680944024,
    "Namespace": "default",
    "NextEval": "",
    "NodeID": "",
    "NodeModifyIndex": 0,
    "PreviousEval": "",
    "Priority": 50,
    "QueuedAllocations": null,
    "QuotaLimitReached": "",
    "RelatedEvals": null,
    "SnapshotIndex": 0,
    "Status": "canceled",
    "StatusDescription": "canceled after more recent eval was processed",
    "TriggeredBy": "deployment-watcher",
    "Type": "service",
    "Wait": 0,
    "WaitUntil": null
}

Again, nothing out of the ordinary, right? Sadly, no.

Key 46e24c94 42842e2e Delta
CreateIndex 87987021 87987065 44
CreateTime 1719414468345035507 1719414468130257405 -215288102
Calendar Time 3:07:48.345 3:07:48.130 -215ms

An evaluation created at T+0 has a greater raft index than another evaluation created at T+215ms.

Reproduction steps

  • be constantly scaling a task group in a job
  • keep doing deployments, be unable to monitor them

Expected Result

  • monitoring a job should result in its successful deployment, or a non-zero exit code.
  • scaling a job that has an ongoing deployment should always return the job scaling blocked due to active deployment error.

Actual Result

  • we get Evaluation "46e24c94" finished with status "canceled" and a zero exit code
  • we're able to run nomad job scale commands against a job

Workaround Thoughts

Right now the only thing I can come up with is taking a lock on a Nomad variable prior to any deployment or scaling activity, but erm, I kinda thought that was Nomad's job. 😅

@tgross
Copy link
Member

tgross commented Jun 26, 2024

Hi @lattwood!

tl;dr for this scenario, you'll probably want to use nomad job deployments -latest :jobid, rather than trying to follow the deployments by chasing the chain of evaluations.


First, I want to provide a little bit of context on why evaluation CreateTime and CreateIndex might not be ordered consistently between evaluations. Each Raft write has to be replicated to followers, and each follower applies this to its in-memory state store and only then is is applied on the leader. The implication of this is that the "FSM apply" step must be deterministic and timestamps have to be created before we try writing to Raft. So whatever RPC handler that's dealing with the event that creates an evaluation (ex. job submission, scaling, node update, etc.) is responsible for setting a CreateTime, and those RPC handlers happen concurrently. The CreateIndex is set by the FSM apply step and is done in a single thread per Raft node. If you're interested, there's some internal developer documentation on the state store that you might find helpful.

What's more, the order of evaluations for a given job basically doesn't matter at all! An evaluation is just a signal to the scheduler that it has work to do for that job. The broker on the leader ensures that only a single evaluation is being processed for a given job at a time (parallel among all jobs, of course), and in recent versions of Nomad we load shed evaluations that we don't need so that we have an evaluation being processed and one waiting (or blocked). You can see that happened in both your example evals 46e24c94 and 42842e2e:

"StatusDescription": "canceled after more recent eval was processed"

Unfortunately that's going to make it very hard to monitor a job deployment via the CLI if you're concurrently and "constantly" pushing new evals into the queue for scaling.

Adding to the trouble here, deployments aren't created atomically with job registration. We don't create a deployment until the evaluation hits the scheduler (because otherwise we'd create expensive no-op deployments for in-place updates). So a legal interleaving of operations could be:

  • request A send a Job.Register RPC to update the job
  • job version 1 + eval abc123 written
  • request B sends a Job.Scale RPC to scale the job; that RPC looks for a deployment and doesn't see one, so it writes job version 2 + eval def456
  • scheduler dequeues eval abc123 (or def456, doesn't matter!)
  • scheduler creates deployment for version 2

Or if the scheduler dequeues the eval before the Job.Scale RPC but doesn't finish scheduling before that RPC checks for a deployment, the plan will get rejected because it'll be on an older version of the job.

This will definitely allow a job change for scaling to slip in between a a job registration is made and the actual deployment is created.

The right way for us to improve this in Nomad would be to add support for a -check-index flag to job scale. The job run command supports a -check-index flag. This allows you to assert that you're making a change against the exact version of the job that you think you are.

In the meantime, you can probably improve your shell script by checking nomad job deployments -latest :jobid, rather than trying to follow the deployments by chasing the chain of evaluations.

@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jun 26, 2024
@tgross tgross self-assigned this Jun 26, 2024
@tgross tgross added theme/deployments theme/autoscaling Issues related to supporting autoscaling and removed theme/autoscaling Issues related to supporting autoscaling labels Jun 26, 2024
@lattwood
Copy link
Contributor Author

wow, thanks for the detailed response.

thinking in our CI/CD to work around this, we should add -detach to the nomad job run command when doing container updates, and use the -latest flag to check the deployment's status in our CI/CD and for CI/CD purposes consider it "green" once some % is hit.

tgross added a commit that referenced this issue Jun 27, 2024
The RPC handler for scaling a job passes flags to enforce the job modify index
is unchanged when it makes the write to Raft. But its only checking against the
existing job modify index at the time the RPC handler snapshots the state store,
so it can only enforce consistency for its own validation.

In clusters with automated scaling, it would be useful to expose the enforce
index options to the API, so that cluster admins can enforce that scaling only
happens when the job state is consistent with a state they've previously seen in
other API calls. Add this option to the CLI and API and have the RPC handler
check them if asked.

Fixes: #23444
@tgross
Copy link
Member

tgross commented Jun 27, 2024

PR for the -check-index flag is up here: #23457

@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Jun 27, 2024
@tgross tgross added this to the 1.8.2 milestone Jun 27, 2024
@tgross tgross added the theme/autoscaling Issues related to supporting autoscaling label Jun 27, 2024
tgross added a commit that referenced this issue Jun 27, 2024
The RPC handler for scaling a job passes flags to enforce the job modify index
is unchanged when it makes the write to Raft. But its only checking against the
existing job modify index at the time the RPC handler snapshots the state store,
so it can only enforce consistency for its own validation.

In clusters with automated scaling, it would be useful to expose the enforce
index options to the API, so that cluster admins can enforce that scaling only
happens when the job state is consistent with a state they've previously seen in
other API calls. Add this option to the CLI and API and have the RPC handler
check them if asked.

Fixes: #23444
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Jun 27, 2024
@lattwood
Copy link
Contributor Author

thanks @tgross!

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/autoscaling Issues related to supporting autoscaling theme/deployments type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants