1.3.0 container memory constraints not in effect leading to OOMs #13031

djenriquez · 2022-05-16T17:21:54Z

Nomad version

Nomad v1.2.4 (9f21b72)
Nomad v1.3.0 (52e95d6)

Operating system and Environment details

Amazon Linux 2

Issue

We recently upgraded our servers and nomad clients to 1.3.0 and witnessed a substantial increase in OOMs reported for java containerized applications in our system. Investigation shows that the OOMs are all being reported on Nomad 1.3.0 clients.

When comparing the docker inspect of two different allocations for the same job, I noticed the following difference:
v1.3.0: "CgroupParent": "cpuset",
v.1.2.4: "CgroupParent": "",

And in the environment variables, this new env var pops up for 1.3.0: "NOMAD_PARENT_CGROUP=/nomad",

Also, an early clue was that only Java applications which used the -XX:MaxRAMPercentage, and related configuration, were affected. Java applications that used a hard defined -Xmx and -Xms were not affected. This lead us to thinking that cgroup limits were not being respected, which ended up being the case.

We believe this issue is related to #12274. When looking at our client properties, we do see a cgroup version, and see that we are v1, which leads us to believe this was supposed to be a no-op, but unfortunately, was not.

The text was updated successfully, but these errors were encountered:

djenriquez · 2022-05-16T17:32:14Z

To add additional context:

In 1.2.4, we had a job defined with 4000MB of memory and a XX:MaxRAMPercentage of 50. When looking at the java heap, we see it is allocated ~2000MB, which is what we expect.

On 1.3.0, that same job was scheduled an allocation that reported a heap of ~8000MB, completely ignoring the 4000MB limit we had set.

Lastly, completely rolling back our clients to 1.2.4 resolved this OOM problem for us.

shoenig · 2022-05-17T14:45:25Z

Thanks for reporting @djenriquez, indeed it is surprising to see any difference on a system where cgroups v1 is in use. Other than the new environment variable, nothing in that code path should have changed.

That docker now sees a CgroupParent is definitely a red flag and gives me a starting point on where to look.

Edit: here's the bit that changed the way docker gets configured:

in drivers/docker/driver.go

+       // Extract the cgroup parent from the nomad cgroup (bypass the need for plugin config)
+       parent, _ := cgutil.SplitPath(task.Resources.LinuxResources.CpusetCgroupPath)
+
        hostConfig := &docker.HostConfig{
+               CgroupParent: parent,
+
                Memory:            memory,            // hard limit
                MemoryReservation: memoryReservation, // soft limit

Not 100% sure this is the root cause; might need something to reproduce with after gating this field on cgroups v2.

binelson · 2022-06-23T20:02:07Z

It looks like a fix for this was merged in, but there isn't included in the release of 1.3.1. Can get we get a release cut for this to fix this issue? We are also having issues with java apps using the -XX:MaxRAMPercentage flag after upgrading to 1.3.1.

tgross · 2022-06-23T20:13:36Z

1.3.1 was a security and panic fix, so it didn't include the rest of the work merged into the 1.3.x series. We have a Nomad 1.3.2 planned soonish that'll include this.

binelson · 2022-06-23T20:17:12Z

Great, thanks @tgross

github-actions · 2022-12-22T02:14:02Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

djenriquez added the type/bug label May 16, 2022

shoenig added the stage/needs-investigation label May 17, 2022

shoenig self-assigned this May 17, 2022

shoenig added the theme/cgroups cgroups issues label May 17, 2022

shoenig added this to the 1.3.x milestone May 17, 2022

shoenig added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label May 17, 2022

shoenig mentioned this issue May 24, 2022

drivers/docker: do not set cgroup parent in v1 mode #13058

Merged

shoenig closed this as completed in #13058 May 24, 2022

hc-github-team-nomad-core mentioned this issue May 24, 2022

Backport of drivers/docker: do not set cgroup parent in v1 mode into release/1.3.x #13110

Merged

lgfa29 modified the milestones: 1.3.x, 1.3.2 Aug 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.3.0 container memory constraints not in effect leading to OOMs #13031

1.3.0 container memory constraints not in effect leading to OOMs #13031

djenriquez commented May 16, 2022

djenriquez commented May 16, 2022 •

edited

Loading

shoenig commented May 17, 2022 •

edited

Loading

binelson commented Jun 23, 2022

tgross commented Jun 23, 2022

binelson commented Jun 23, 2022

github-actions bot commented Dec 22, 2022

1.3.0 container memory constraints not in effect leading to OOMs #13031

1.3.0 container memory constraints not in effect leading to OOMs #13031

Comments

djenriquez commented May 16, 2022

Nomad version

Operating system and Environment details

Issue

djenriquez commented May 16, 2022 • edited Loading

shoenig commented May 17, 2022 • edited Loading

binelson commented Jun 23, 2022

tgross commented Jun 23, 2022

binelson commented Jun 23, 2022

github-actions bot commented Dec 22, 2022

djenriquez commented May 16, 2022 •

edited

Loading

shoenig commented May 17, 2022 •

edited

Loading