Scaling is possible during deployment with bad results #23444

lattwood · 2024-06-26T16:03:21Z

Super Quick tl;dr;

Evaluations have a different sort order when sorting by CreateIndex vs CreateTime, and this has impacts on deployment of jobs that do scaling.

Nomad version

~# nomad version
Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

Relevant- we're using a shell script to query nomad, and then run various nomad job scale commands. The shell script also checks to see if a deployment is already in progress prior to scaling, but as you'll soon see due to the timeframes involved it wouldn't have helped.

Issue

When performing a deployment of a job that is constantly being scaled, instead of getting a failure to scale and the job scaling blocked due to active deployment error from Nomad, spooky things happen.

This is the spooky thing.

==> 2024-06-26T15:07:49Z: Monitoring evaluation "46e24c94"
    2024-06-26T15:07:49Z: Evaluation triggered by job "JOB-NAME"
    2024-06-26T15:07:50Z: Evaluation status changed: "pending" -> "canceled"
==> 2024-06-26T15:07:50Z: Evaluation "46e24c94" finished with status "canceled"

I go and take a look at the evaluation it created.

> nomad eval status -json 46e24c94
{
    "AnnotatePlan": false,
    "BlockedEval": "",
    "ClassEligibility": null,
    "CreateIndex": 87987021,
    "CreateTime": 1719414468345035507,
    "DeploymentID": "",
    "EscapedComputedClass": false,
    "FailedTGAllocs": null,
    "ID": "46e24c94-b015-0f3e-4374-ae5ffe59d202",
    "JobID": "JOB-NAME",
    "JobModifyIndex": 87987021,
    "ModifyIndex": 87987127,
    "ModifyTime": 1719414469680946947,
    "Namespace": "default",
    "NextEval": "",
    "NodeID": "",
    "NodeModifyIndex": 0,
    "PreviousEval": "",
    "Priority": 50,
    "QueuedAllocations": null,
    "QuotaLimitReached": "",
    "RelatedEvals": null,
    "SnapshotIndex": 0,
    "Status": "canceled",
    "StatusDescription": "canceled after more recent eval was processed",
    "TriggeredBy": "job-register",
    "Type": "service",
    "Wait": 0,
    "WaitUntil": null
}

Nothing out of the ordinary, save for the NextEval field being empty (would be nice if that was populated in this case). DeploymentID is empty but that might be expected with job-register, but anyways- I go into the Nomad UI armed with the timestamp of the evaluation's creation (CreateTime), go to the evaluation directly above it in the list and look at that.

> nomad eval status -json 42842e2e
{
    "AnnotatePlan": false,
    "BlockedEval": "",
    "ClassEligibility": null,
    "CreateIndex": 87987065,
    "CreateTime": 1719414468130257405,
    "DeploymentID": "7ccedd16-93b1-ac73-dd43-54fd218e0e58",
    "EscapedComputedClass": false,
    "FailedTGAllocs": null,
    "ID": "42842e2e-f6c5-79e9-f2c2-bbb9fb867443",
    "JobID": "JOB-NAME",
    "JobModifyIndex": 0,
    "ModifyIndex": 87987127,
    "ModifyTime": 1719414469680944024,
    "Namespace": "default",
    "NextEval": "",
    "NodeID": "",
    "NodeModifyIndex": 0,
    "PreviousEval": "",
    "Priority": 50,
    "QueuedAllocations": null,
    "QuotaLimitReached": "",
    "RelatedEvals": null,
    "SnapshotIndex": 0,
    "Status": "canceled",
    "StatusDescription": "canceled after more recent eval was processed",
    "TriggeredBy": "deployment-watcher",
    "Type": "service",
    "Wait": 0,
    "WaitUntil": null
}

Again, nothing out of the ordinary, right? Sadly, no.

Key	`46e24c94`	`42842e2e`	Delta
CreateIndex	87987021	87987065	44
CreateTime	1719414468345035507	1719414468130257405	-215288102
Calendar Time	`3:07:48.345`	`3:07:48.130`	-215ms

An evaluation created at T+0 has a greater raft index than another evaluation created at T+215ms.

Reproduction steps

be constantly scaling a task group in a job
keep doing deployments, be unable to monitor them

Expected Result

monitoring a job should result in its successful deployment, or a non-zero exit code.
scaling a job that has an ongoing deployment should always return the job scaling blocked due to active deployment error.

Actual Result

we get Evaluation "46e24c94" finished with status "canceled" and a zero exit code
we're able to run nomad job scale commands against a job

Workaround Thoughts

Right now the only thing I can come up with is taking a lock on a Nomad variable prior to any deployment or scaling activity, but erm, I kinda thought that was Nomad's job. 😅

The text was updated successfully, but these errors were encountered:

tgross · 2024-06-26T19:59:30Z

Hi @lattwood!

tl;dr for this scenario, you'll probably want to use nomad job deployments -latest :jobid, rather than trying to follow the deployments by chasing the chain of evaluations.

First, I want to provide a little bit of context on why evaluation CreateTime and CreateIndex might not be ordered consistently between evaluations. Each Raft write has to be replicated to followers, and each follower applies this to its in-memory state store and only then is is applied on the leader. The implication of this is that the "FSM apply" step must be deterministic and timestamps have to be created before we try writing to Raft. So whatever RPC handler that's dealing with the event that creates an evaluation (ex. job submission, scaling, node update, etc.) is responsible for setting a CreateTime, and those RPC handlers happen concurrently. The CreateIndex is set by the FSM apply step and is done in a single thread per Raft node. If you're interested, there's some internal developer documentation on the state store that you might find helpful.

What's more, the order of evaluations for a given job basically doesn't matter at all! An evaluation is just a signal to the scheduler that it has work to do for that job. The broker on the leader ensures that only a single evaluation is being processed for a given job at a time (parallel among all jobs, of course), and in recent versions of Nomad we load shed evaluations that we don't need so that we have an evaluation being processed and one waiting (or blocked). You can see that happened in both your example evals 46e24c94 and 42842e2e:

"StatusDescription": "canceled after more recent eval was processed"

Unfortunately that's going to make it very hard to monitor a job deployment via the CLI if you're concurrently and "constantly" pushing new evals into the queue for scaling.

Adding to the trouble here, deployments aren't created atomically with job registration. We don't create a deployment until the evaluation hits the scheduler (because otherwise we'd create expensive no-op deployments for in-place updates). So a legal interleaving of operations could be:

request A send a Job.Register RPC to update the job
job version 1 + eval abc123 written
request B sends a Job.Scale RPC to scale the job; that RPC looks for a deployment and doesn't see one, so it writes job version 2 + eval def456
scheduler dequeues eval abc123 (or def456, doesn't matter!)
scheduler creates deployment for version 2

Or if the scheduler dequeues the eval before the Job.Scale RPC but doesn't finish scheduling before that RPC checks for a deployment, the plan will get rejected because it'll be on an older version of the job.

This will definitely allow a job change for scaling to slip in between a a job registration is made and the actual deployment is created.

The right way for us to improve this in Nomad would be to add support for a -check-index flag to job scale. The job run command supports a -check-index flag. This allows you to assert that you're making a change against the exact version of the job that you think you are.

In the meantime, you can probably improve your shell script by checking nomad job deployments -latest :jobid, rather than trying to follow the deployments by chasing the chain of evaluations.

lattwood · 2024-06-26T23:58:54Z

wow, thanks for the detailed response.

thinking in our CI/CD to work around this, we should add -detach to the nomad job run command when doing container updates, and use the -latest flag to check the deployment's status in our CI/CD and for CI/CD purposes consider it "green" once some % is hit.

The RPC handler for scaling a job passes flags to enforce the job modify index is unchanged when it makes the write to Raft. But its only checking against the existing job modify index at the time the RPC handler snapshots the state store, so it can only enforce consistency for its own validation. In clusters with automated scaling, it would be useful to expose the enforce index options to the API, so that cluster admins can enforce that scaling only happens when the job state is consistent with a state they've previously seen in other API calls. Add this option to the CLI and API and have the RPC handler check them if asked. Fixes: #23444

tgross · 2024-06-27T18:28:47Z

PR for the -check-index flag is up here: #23457

The RPC handler for scaling a job passes flags to enforce the job modify index is unchanged when it makes the write to Raft. But its only checking against the existing job modify index at the time the RPC handler snapshots the state store, so it can only enforce consistency for its own validation. In clusters with automated scaling, it would be useful to expose the enforce index options to the API, so that cluster admins can enforce that scaling only happens when the job state is consistent with a state they've previously seen in other API calls. Add this option to the CLI and API and have the RPC handler check them if asked. Fixes: #23444

lattwood · 2024-06-28T12:34:40Z

thanks @tgross!

github-actions · 2024-12-21T02:16:18Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

lattwood added the type/bug label Jun 26, 2024

tgross added this to Nomad - Community Issues Triage Jun 26, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jun 26, 2024

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jun 26, 2024

tgross self-assigned this Jun 26, 2024

tgross added theme/deployments theme/autoscaling Issues related to supporting autoscaling and removed theme/autoscaling Issues related to supporting autoscaling labels Jun 26, 2024

tgross mentioned this issue Jun 27, 2024

scale: add -check-index to job scale command #23457

Merged

tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Jun 27, 2024

tgross added this to the 1.8.2 milestone Jun 27, 2024

tgross added the theme/autoscaling Issues related to supporting autoscaling label Jun 27, 2024

tgross closed this as completed in #23457 Jun 27, 2024

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Jun 27, 2024

hc-github-team-nomad-core mentioned this issue Jun 27, 2024

Backport of scale: add -check-index to job scale command into release/1.8.x #23460

Merged

github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling is possible during deployment with bad results #23444

Scaling is possible during deployment with bad results #23444

lattwood commented Jun 26, 2024 •

edited

Loading

tgross commented Jun 26, 2024

lattwood commented Jun 26, 2024

tgross commented Jun 27, 2024

lattwood commented Jun 28, 2024

github-actions bot commented Dec 21, 2024

Scaling is possible during deployment with bad results #23444

Scaling is possible during deployment with bad results #23444

Comments

lattwood commented Jun 26, 2024 • edited Loading

Super Quick tl;dr;

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Workaround Thoughts

tgross commented Jun 26, 2024

lattwood commented Jun 26, 2024

tgross commented Jun 27, 2024

lattwood commented Jun 28, 2024

github-actions bot commented Dec 21, 2024

lattwood commented Jun 26, 2024 •

edited

Loading