-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.3.0 container memory constraints not in effect leading to OOMs #13031
Comments
To add additional context: In 1.2.4, we had a job defined with 4000MB of memory and a On 1.3.0, that same job was scheduled an allocation that reported a heap of ~8000MB, completely ignoring the 4000MB limit we had set. Lastly, completely rolling back our clients to 1.2.4 resolved this OOM problem for us. |
Thanks for reporting @djenriquez, indeed it is surprising to see any difference on a system where cgroups v1 is in use. Other than the new environment variable, nothing in that code path should have changed. That docker now sees a Edit: here's the bit that changed the way docker gets configured: in
Not 100% sure this is the root cause; might need something to reproduce with after gating this field on cgroups v2. |
It looks like a fix for this was merged in, but there isn't included in the release of 1.3.1. Can get we get a release cut for this to fix this issue? We are also having issues with java apps using the |
1.3.1 was a security and panic fix, so it didn't include the rest of the work merged into the 1.3.x series. We have a Nomad 1.3.2 planned soonish that'll include this. |
Great, thanks @tgross |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.2.4 (9f21b72)
Nomad v1.3.0 (52e95d6)
Operating system and Environment details
Amazon Linux 2
Issue
We recently upgraded our servers and nomad clients to 1.3.0 and witnessed a substantial increase in OOMs reported for java containerized applications in our system. Investigation shows that the OOMs are all being reported on Nomad 1.3.0 clients.
When comparing the
docker inspect
of two different allocations for the same job, I noticed the following difference:v1.3.0:
"CgroupParent": "cpuset",
v.1.2.4:
"CgroupParent": "",
And in the environment variables, this new env var pops up for 1.3.0:
"NOMAD_PARENT_CGROUP=/nomad",
Also, an early clue was that only Java applications which used the
-XX:MaxRAMPercentage
, and related configuration, were affected. Java applications that used a hard defined-Xmx
and-Xms
were not affected. This lead us to thinking that cgroup limits were not being respected, which ended up being the case.We believe this issue is related to #12274. When looking at our client properties, we do see a cgroup version, and see that we are
v1
, which leads us to believe this was supposed to be a no-op, but unfortunately, was not.The text was updated successfully, but these errors were encountered: