Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to normally stop and purge system job with csi plugin #11758

Closed
ygersie opened this issue Jan 3, 2022 · 7 comments · Fixed by #12114
Closed

Unable to normally stop and purge system job with csi plugin #11758

ygersie opened this issue Jan 3, 2022 · 7 comments · Fixed by #12114
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@ygersie
Copy link
Contributor

ygersie commented Jan 3, 2022

Nomad version

v1.2.3

Operating system and Environment details

MacOS nomad agent -dev setup

Issue

Unable to stop and purge a failed system job which has a csi_plugin stanza and unexpected start of the job when -purge is passed.

Reproduction steps

Run below example job.

job "example" {
  datacenters = ["dc1"]
  type        = "system"

  group "example" {
    task "example" {
      driver = "docker"
      config {
        image = "alpine"
        args  = ["/bin/sh", "-c", "exit 1"]
      }

      restart {
        attempts = 1
        interval = "10s"
        delay    = "5s"
        mode     = "fail"
      }

      csi_plugin {
        id        = "example"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Now wait until the job transitions to the failed state, then stop + purge the job.

nomad job stop -purge example

Now check the status of the job:

$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2022-01-03T09:33:10+01:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
example     0       0         0        1       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
86f91d4c  73681f8d  example     0        run      failed  2m52s ago  2m42s ago

This should've returned a not found error but it's still there and the Desired column states run. Re-running nomad job stop -purge example doesn't change the outcome until a GC has been run. Now trigger a GC with nomad system gc and rerun the stop -purge again, the result becomes:

$ nomad job stop -purge example
==> 2022-01-03T09:44:23+01:00: Monitoring evaluation "ed1f4546"
    2022-01-03T09:44:23+01:00: Evaluation triggered by job "example"
    2022-01-03T09:44:23+01:00: Allocation "c8c61822" created: node "73681f8d", group "example"
==> 2022-01-03T09:44:24+01:00: Monitoring evaluation "ed1f4546"
    2022-01-03T09:44:24+01:00: Evaluation status changed: "pending" -> "complete"
==> 2022-01-03T09:44:24+01:00: Evaluation "ed1f4546" finished with status "complete"

Instead of stopping it actually recreates the allocation again..

@jrasell
Copy link
Member

jrasell commented Jan 3, 2022

Hi @ygersie and thanks for providing such a detailed reproduction. I ran through this locally and got the same results as you detailed. These results are very unexpected.

@jrasell jrasell added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage labels Jan 3, 2022
@tgross
Copy link
Member

tgross commented Jan 3, 2022

I suspect this is related to another CSI plugin counts issue; I'm taking a pass through our open CSI issues over the next few weeks and will look at this as part of that work.

@tgross
Copy link
Member

tgross commented Feb 3, 2022

Noting here that I've marked #11114 as a duplicate of this one. #10073 may also ultimately be a duplicate but I'll leave that open for the time being as the cause is subtly different.

It looks like there are two parts to this:

  • The plugin counts don't accurately match the state of allocations. I'm working up a patch the resolves plugin counts in a similar fashion to what we did for volume claims in CSI: resolve invalid claim states #11890
  • There may also be a race between how we trigger allocation stops from a job purge, which in turn triggers the plugin GC, and how we purge the job. We can't GC the plugin until the allocation is terminal, but perhaps (incorrectly) can't GC the job until the plugin is GC'd. I'll be looking into this as well.

@tgross
Copy link
Member

tgross commented Feb 8, 2022

I've opened #12027 which is targeting #9810 and #10073 but may be a partial fix for this issue. Once I've got that merged I'll be digging into this.

@tgross tgross added this to the 1.3.0 milestone Feb 17, 2022
@tgross
Copy link
Member

tgross commented Feb 23, 2022

Ok, so following #12027, #10073, and #12078 we've almost got this one resolved. There's just one bug left, which is that we can't deregister the job because it's looking to delete the plugin that doesn't exist:

2022-02-23T20:51:35.777Z [ERROR] nomad.fsm: deregistering job failed: error="DeleteJob failed: deleting job from plugin: plugin missing: example " job=badplugin namespace=default

@tgross
Copy link
Member

tgross commented Feb 23, 2022

Fixed in #12114! That'll ship in Nomad 1.3.0

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants