Handling memory contention and OOM Killer #10414
Labels
stage/accepted
Confirmed, and intend to work on. No timeline committment though.
theme/restart/reschedule
theme/scheduling
type/enhancement
When a host is running with contended memory, nomad needs to take extra care to avoid exacerbating the situation. If a workload is OOMed due to contended memory, it should be rescheduled aggressively rather than be restarted. Restarting kill OOM-killed tasks may cause further memory contention and further OOM activity.
Memory contention can arise due to the memory oversubscription feature introduced in #10247. It's also possible that host system services that aren't manage by Nomad may spike their memory usage beyond the configured
reserved
memory flag.Memory contention may occur thorough
Nomad must distinguish between tasks that exceed their memory limit and are OOMed from bystander tasks that are killed because they were chosen as a victim in an oversubscribed host.
The text was updated successfully, but these errors were encountered: