Worker Low Heap Memory Task Killer #21254

swapsmagic · 2023-10-26T23:12:05Z

Description

With this experimental feature, we want worker to kill tasks in case its running in low memory
(configurable) and during full GC not able to free up enough memory (configurable). For such scenario
worker identifies task consuming the most memory and kill it.

There are two strategies implemented in the change when task killer kicked in

During full GC if worker is not able to free enough memory (configured) and heap usage is above configured threshold
During frequent full GCs (configured) if worker not able to free enough memory (configured)

Motivation and Context

n Meta, we noticed instances where if we run worker on a low memory machine and
it runs high memory intensive workload, there are times when worker is running low
memory and full GC not able to free up enough memory. This result in worker OOMs when
worker continue progressing with the work trying to capture more memory.

Impact

Able to prevent worker OOM and able to run queries successfully instead of killing all running queries.

Test Plan

Ran shadow in presto verifier where with the feature disable all queries died due to REMOTE_TASK_ERROR or REMOTE_HOST_GONE due to worker OOM. With the feature enabled, subset of the queries failed with INSUFFICIENT_RESOURCES due to worker heap memory is high. And subset able to finish successfully.

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add High Memory Task Killer which troggers if worker is running out of memory and garbage collection not able to reclaim enough memory. The feature is enabled by `experimental.task.high-memory-task-killer-enabled`.  There are two strategies for the task killer which can be set via `experiemental.task.high-memory-task-killer-strategy`. (default: `FREE_MEMORY_ON_FULL_GC`, `FREE_MEMORY_ON_FREQUENT_FULL_GC`).
   * Free memory on full garbage collection strategy triggers task killer if worker is running low memory and full GC is not able to reclaim enough memory. Low memory threhold is set by  `experimental.task.high-memory-task-killer-heap-memory-threshold`  and Full GC reclaim memory threhold is set by `experimental.task.high-memory-task-killer-reclaim-memory-threshold`
   * Free memory on frequent full garbage collection strategy triggers task killer if worker is having frequent garbage collection and enough memory is not reclaimed during the garbage collections. Frequent garbage collection is identified via `experimental.task.high-memory-task-killer-frequent-full-gc-duration-threshold` and reclaim memory threshold is configured via `experimental.task.high-memory-task-killer-reclaim-memory-threshold`

agrawaldevesh · 2023-10-29T05:50:02Z

How is this feature being tested ? Can we also add some monitoring / metrics ?

agrawaldevesh · 2023-10-28T22:02:07Z

presto-main/src/main/java/com/facebook/presto/memory/LowMemoryTaskKiller.java

+        }
+    }
+
+    private void onGCNotification(Notification notification)


Can this be put into the GcStatsMonitor itself that already exists ?

Introduce a new class due the existing one is for monitoring the GC activily.

Pushing back a bit: this one does that too except it acts on the gc signals

agrawaldevesh · 2023-10-28T22:02:58Z

presto-main/src/main/java/com/facebook/presto/memory/LowMemoryTaskKiller.java

+                DataSize afterGcDataSize = info.getAfterGcTotal();
+                long garbageCollectedBytes = beforeGcDataSize.toBytes() - afterGcDataSize.toBytes();
+
+                if (isLowMemory() && garbageCollectedBytes < lowMemoryTaskKillThreshold.toBytes()) {


Should we only kill when we haven't been able to reclaim for a while. Like for a couple of times ?

Keeping track of the memory state is not accurate with this functionality as we won't have visibility into between last full GC and current one if the worker able to free up memory or not.

I am not following. Why can you not keep the previous bytes after full gc in a class instance variable along with timestamp ?

agrawaldevesh · 2023-10-29T05:47:08Z

presto-main/src/main/java/com/facebook/presto/memory/LowMemoryTaskKiller.java

+                DataSize afterGcDataSize = info.getAfterGcTotal();
+                long garbageCollectedBytes = beforeGcDataSize.toBytes() - afterGcDataSize.toBytes();
+
+                if (isLowMemory() && garbageCollectedBytes < lowMemoryTaskKillThreshold.toBytes()) {


I am slightly confused: lowMemoryTaskKillThreshhold is used to multiply with max memory in isLowMemory method and is also used as an absolute value in the RHS ? Is it a ratio ? If not, why is GC bytes compared with it ?

lowMemoryTaskKillThreshhold is meant to provide the memory threshold that if full GC able to reclaim then the task killer won't kick in. The other threshold is lowMemoryTaskKillerMemoryThreshold is used to mark when do we want to consider worker is running low memory.

agrawaldevesh · 2023-10-29T05:47:55Z

presto-main/src/main/java/com/facebook/presto/memory/LowMemoryTaskKiller.java

+                    if (queryId.isPresent()) {
+                        List<SqlTask> activeTasksToKill = activeQueriesToTasksMap.get(queryId.get());
+                        for (SqlTask sqlTask : activeTasksToKill) {
+                            sqlTask.failed(new PrestoException(EXCEEDED_HEAP_MEMORY_LIMIT, "Worker heap memory limit exceeded"));


some more stats in this error msg ?

agrawaldevesh · 2023-10-29T05:48:41Z

presto-main/src/main/java/com/facebook/presto/memory/LowMemoryTaskKiller.java

+                DataSize afterGcDataSize = info.getAfterGcTotal();
+                long garbageCollectedBytes = beforeGcDataSize.toBytes() - afterGcDataSize.toBytes();
+
+                if (isLowMemory() && garbageCollectedBytes < lowMemoryTaskKillThreshold.toBytes()) {


Should we have some kind of a "quiet period" after it has killed a query to observe the effects ? Like how do we prevent this from going rogue ?

Full GC won't happen once the query has been killed unless there are more memory hungry queries running and consuming memory resulting in the same scenario. And in that case we want it to kick in and kill them to avoid jvm oom.

agrawaldevesh · 2023-10-29T05:49:23Z

presto-main/src/main/java/com/facebook/presto/memory/LowMemoryTaskKiller.java

+                ).sorted(comparator.reversed())
+                .map(Map.Entry::getKey)
+                .collect(toImmutableList());
+        return Optional.of(queryIdsSortedByMemoryUsage.get(0));


Can you use maxBy or such instead of sorting ?

hantangwangd · 2023-10-31T00:07:43Z

presto-main/src/main/java/com/facebook/presto/memory/LowMemoryTaskKiller.java

+                        new AbstractMap.SimpleEntry<>(entry.getKey(), entry.getValue().stream()
+                                .map(SqlTask::getTaskInfo)
+                                .map(TaskInfo::getStats)
+                                .mapToLong(stats -> stats.getUserMemoryReservationInBytes() + stats.getSystemMemoryReservationInBytes())


Should we add stats.getRevocableMemoryReservationInBytes() too?

NikhilCollooru

Can we add some unit tests ?

presto-main/src/main/java/com/facebook/presto/memory/HighMemoryTaskKillerStrategy.java

agrawaldevesh · 2023-10-30T21:12:04Z

presto-main/src/main/java/com/facebook/presto/execution/TaskManagerConfig.java

+        return lowMemoryTaskKillerEnabled;
+    }
+
+    @Config("experimental.task.low-memory-task-killer-enabled")


Can this be enabled is the other two configs are set appropriately? Trying to avoid a third Boolean config about this features

agrawaldevesh · 2023-11-02T19:08:15Z