Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment status success even if there are no healthy allocations in nomad #227

Closed
yellowmegaman opened this issue Aug 21, 2018 · 3 comments

Comments

@yellowmegaman
Copy link

Hi there! I think i'm doing something terribly wrong here, but:

Relevant Nomad job specification file

job "example" {
  datacenters = ["newdev"]
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
        network {
          mbits = 10
          port "db" {}
        }
      }
      service {
        name = "redis-cache"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
        check {
          type     = "script"
          name     = "nonexisting-script"
          command  = "/usr/bin/nonexisting-script"
          interval = "60s"
          timeout  = "120s"
          check_restart {
            limit = 3
            grace = "90s"
            ignore_warnings = false
          }
        } #check
      }
    }
  }
}

Output of levant version:

Levant v0.2.2
Date: 2018-08-06T09:05:20Z
Commit: df0fe5759617805430bd847d9d445ff9a65b01c6
Branch: 0.2.2
State: 0.2.2
Summary: df0fe5759617805430bd847d9d445ff9a65b01c6

Output of consul version:

Consul v1.2.2
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Output of nomad version:

Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)

Additional environment details:
Debian, consul, nomad, docker 18.06.0-ce

Debug log outputs from Levant:

levant deploy -log-level=debug -address=http://xxx.xxx.xxx.xxx:4646 example.nomad
2018-08-21T22:19:26+03:00 |DEBU| template/render: no variable file passed, trying defaults
2018-08-21T22:19:26+03:00 |DEBU| helper/files: no default var-file found
2018-08-21T22:19:26+03:00 |DEBU| template/render: no command line variables passed
2018-08-21T22:19:27+03:00 |INFO| levant/deploy: job is not running, using template file group counts job_id=example
2018-08-21T22:19:27+03:00 |DEBU| levant/plan: triggering Nomad plan job_id=example
2018-08-21T22:19:27+03:00 |INFO| levant/plan: job is a new addition to the cluster job_id=example
2018-08-21T22:19:27+03:00 |INFO| levant/deploy: triggering a deployment job_id=example
2018-08-21T22:19:28+03:00 |INFO| levant/deploy: evaluation 48794f86-d5cc-dbec-df60-59c639418a40 finished successfully job_id=example
2018-08-21T22:19:28+03:00 |INFO| levant/deploy: job is not configured with update stanza, consider adding to use deployments job_id=example
2018-08-21T22:19:28+03:00 |DEBU| levant/job_status_checker: running job status checker for job job_id=example
2018-08-21T22:19:28+03:00 |INFO| levant/job_status_checker: job has status running job_id=example
2018-08-21T22:19:28+03:00 |INFO| levant/job_status_checker: task redis in allocation 9ba1aee8-6fd4-5c23-c033-632760ce1956 now in pending state job_id=example
2018-08-21T22:19:33+03:00 |INFO| levant/job_status_checker: task redis in allocation 9ba1aee8-6fd4-5c23-c033-632760ce1956 now in running state job_id=example
2018-08-21T22:19:33+03:00 |INFO| levant/job_status_checker: all allocations in deployment of job are running job_id=example
2018-08-21T22:19:33+03:00 |INFO| levant/deploy: job deployment successful job_id=example

Nomad allocation info:
https://gist.github.com/yellowmegaman/3ebea496c67ee5d512f8e7a6658afcab

Consul healthcheck:
image

So i've used nomad init to create default redis template, removed almost all features and blocks, changed datacenter name and added healthcheck with non-existent binary.

After running deployment with levant i was expecting it to wait for all stuff to settle down, healthchecks including, since i saw Advanced Job Status Checking: Particulary for system and batch jobs, Levant will ensure the job, evaluations and allocations all reach the desired state providing feedback at every stage. in the docs.

And one more thing, i was really surprised to know that it looks like nomad deployments are not in use when deploying with levant, is that correct?

Thanks in advance, hope i just misread something.

@jrasell
Copy link
Member

jrasell commented Aug 21, 2018

@yellowmegaman thanks for raising this and the detail. I believe I know what is going on here and can explain a couple of points and answer a few questions.

  • it looks like nomad deployments are not in use when deploying with levant as logged, the job does not have an update stanza and so Nomad will not create a deployment for it. Levant as only a wrapper around Nomad and does not interfere with, or alter the internal scheduling and decision making of Nomad. It merely attempts to provide additional feedback on events to improve CD.

  • the bug/feature I believe is because Nomad reports the allocs as running and then restarts/kills them shortly after as shown in the alloc detail. This is because the task does indeed start, but then after a time the check fails and the check_restart kicks in. Levant watches the allocs and when it reaches the running state will exit rather than keep watching them for an additional period.

An immediate workaround for yourself is to add an update stanza which will allow this job to make use of Nomad deployments which will improve your experience overall.

For Levant going forward I think it would be worth adding an additional watch period, configurable but with a sane default which will ensure restarting tasks in this manner a reported correctly. I would appreciate any feedback you have into this.

@yellowmegaman
Copy link
Author

yellowmegaman commented Aug 21, 2018

@jrasell thanks for quick reply!
Actually first time i've encountered this situation i had update stanza (i use almost all features of nomad).

Levant watches the allocs and when it reaches the running state will exit rather than keep watching them for an additional period. - but does it mean that levant only watch for running state? because in example above healthchecks are initially red and continue to be so.

My attempt with update:

$ nomad stop -purge example
==> Monitoring evaluation "471f0856"
    Evaluation triggered by job "example"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "471f0856" finished with status "complete"

Added update:

job "example" {
  datacenters = ["newdev"]
  type = "service"
  group "cache" {
    update {
      max_parallel = 1
      min_healthy_time = "10s"
      healthy_deadline = "3m"
      progress_deadline = "10m"
      auto_revert = false
      canary = 0
    }
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
        network {
          mbits = 10
          port "db" {}
        }
      }
      service {
        name = "redis-cache"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
        check {
          type     = "script"
          name     = "nonexisting-script"
          command  = "/usr/bin/nonexisting-script"
          interval = "60s"
          timeout  = "120s"
          check_restart {
            limit = 3
            grace = "90s"
            ignore_warnings = false
          }
        } #check
      }
    }
  }
}

Running levant:

$ levant deploy -log-level=debug -address=http://xxx.xxx.xxx.xxx:4646 github.nomad 
2018-08-21T23:34:00+03:00 |DEBU| template/render: no variable file passed, trying defaults
2018-08-21T23:34:00+03:00 |DEBU| helper/files: no default var-file found
2018-08-21T23:34:00+03:00 |DEBU| template/render: no command line variables passed
2018-08-21T23:34:00+03:00 |INFO| levant/deploy: job is not running, using template file group counts job_id=example
2018-08-21T23:34:00+03:00 |DEBU| levant/plan: triggering Nomad plan job_id=example
2018-08-21T23:34:01+03:00 |INFO| levant/plan: job is a new addition to the cluster job_id=example
2018-08-21T23:34:01+03:00 |INFO| levant/deploy: triggering a deployment job_id=example
2018-08-21T23:34:02+03:00 |INFO| levant/deploy: evaluation 8f59210c-1719-bfc9-8851-786adbd01c5d finished successfully job_id=example
2018-08-21T23:34:02+03:00 |INFO| levant/deploy: job is not configured with update stanza, consider adding to use deployments job_id=example
2018-08-21T23:34:02+03:00 |DEBU| levant/job_status_checker: running job status checker for job job_id=example
2018-08-21T23:34:02+03:00 |INFO| levant/job_status_checker: job has status running job_id=example
2018-08-21T23:34:02+03:00 |INFO| levant/job_status_checker: task redis in allocation 18a28f20-0eb7-188c-e37a-cc7ee70ffcef now in running state job_id=example
2018-08-21T23:34:02+03:00 |INFO| levant/job_status_checker: all allocations in deployment of job are running job_id=example
2018-08-21T23:34:02+03:00 |INFO| levant/deploy: job deployment successful job_id=example
cloud@newdev-2:~$ nomad deployment list
ID        Job ID   Job Version  Status   Description
5501e535  example  0            running  Deployment is running
cloud@newdev-2:~$ nomad deployment status 5501e535
ID          = 5501e535
Job ID      = example
Job Version = 0
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       0        0          2018-08-21T20:44:01Z
cloud@newdev-2:~$

My main goal was to replace my monstrous 62-line bash script with something close to AI to prove deployment successful =)

This script is basically a loop looking for desired=healthy allocations of service in nomad api.

About additional watch period - i don't follow you here, isn't it right just to watch for desired/placed/healthy instead of period of time?

Thanks again)

@yellowmegaman
Copy link
Author

yellowmegaman commented Dec 3, 2018

Just tried it again with latest version, added nonexistent script healthcheck to default example.nomad job, levant did everything nice and correct.
After timeout defined in update stanza deployment was marked failed and terminal with levant was released with proper exit code.

Thumbs up and thank you )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants