Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emit nomad.client.allocs.oom_killed for raw_exec jobs #19767

Closed
shantanugadgil opened this issue Jan 17, 2024 · 5 comments · Fixed by #19829
Closed

emit nomad.client.allocs.oom_killed for raw_exec jobs #19767

shantanugadgil opened this issue Jan 17, 2024 · 5 comments · Fixed by #19829
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/raw_exec type/bug

Comments

@shantanugadgil
Copy link
Contributor

Nomad version

Nomad v1.7.3
BuildDate 2024-01-15T16:55:40Z
Revision 60ee328

Operating system and Environment details

Amazon Linux 2/2023

Issue

Nomad 1.7.x honors the resource parameter and kills a task (as expected) when the memory threshold is crossed.

The metric nomad.client.allocs.oom_killed does NOT seem to be emitted for the task which was killed.

Reproduction steps

run the specified job spec as a raw_exec task and observe it is killed, but no metrics emitted. (I use statsd)

Expected Result

the metric should be emitted for such a killed task

Actual Result

no metric is emitted

Job file (if appropriate)

The relevant task section:

    task "oom" {

      template {
        data            = file("oom_test.py")
        destination     = "local/oom_test.py"

        left_delimiter  = "{[("
        right_delimiter = ")]}"
      }

      driver = "raw_exec"

      config {
        command = "python3"
        args    = ["-u", "local/oom_test.py"]
      }

      resources {
        cpu    = 512
        memory = 16
      }
#!/usrbin/python

s = []
for i in range(10000):
    for j in range(10000):
        for k in range(10000):
            s.append("0000000000")

possibly related to #19204 ?

@tgross
Copy link
Member

tgross commented Jan 17, 2024

Hi @shantanugadgil! Thanks for opening this issue. I'll get it marked for roadmapping.

possibly related to #19204 ?

Yeah, almost certainly. The raw_exec and exec drivers share much of the same internals in drivers/shared/executor.

@shantanugadgil
Copy link
Contributor Author

🥳 🎉

@shantanugadgil
Copy link
Contributor Author

@tgross @jrasell I am still not getting the metric nomad.client.allocs.oom_killed for raw_exec type of job.

This is my telemetry setting on the agent:

$ cat telemetry.hcl 
telemetry {
  publish_allocation_metrics = true

  publish_node_metrics       = true

  # reduce cardinality (?)
  disable_hostname           = true

  collection_interval        = "5s"

  datadog_address            = "127.0.0.1:8125"
  prometheus_metrics         = true

  disable_dispatched_job_summary_metrics = true
}

I verified using a cmdline as mentioned here:

https://docs.datadoghq.com/developers/dogstatsd/datagram_shell/?tab=metrics

... to verify that custom.metric.name is coming through.

@shantanugadgil
Copy link
Contributor Author

when I run the same python script inside a docker, the metric is indeed reported correctly:

    task "oom" {

      template {
        data            = file("oom_test.py")
        destination     = "local/oom_test.py"

        left_delimiter  = "{[("
        right_delimiter = ")]}"
      }

      driver = "docker"

      config {
        image = "python:3-alpine"
        auth_soft_fail = true


        command = "python3"
        args    = ["-u", "/local/oom_test.py"]
      }

      resources {
        cpu    = 512
        memory = 32
      }

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/raw_exec type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants