Heap Memory Based Worker Flag to stop processing new split when in low memory #20946

swapsmagic · 2023-09-22T22:07:59Z

Description

Introducing low memory monitor to keep an eye on heap usage. And when the heap usage crosses configured threshold, worker will stop processing new splits but continue processing the existing running, waiting and blocked splits. When the heap memory usage goes below threshold, worker start accepting new splits.

Motivation and Context

In Meta, we observed workers crashing due to out of memory. And the worker crashing are running lot more splits (running + waiting) when memory usage spikes. Given the Multi Level Split Queue behavior, workers keep running all the splits unless they are blocked. If the number of non blocked splits are high, this result in memory pressure on the worker and chances of OOM is higher. With this change, we can configure worker to avoid processing more splits if memory usage is high.

Impact

No Impact

Test Plan

Unit Tests and Meta verifier run

JMX metric showing when worker memory usage crosses threshold, skip split counter spikes. And it goes down as memory goes below threhold.

Five Minute Counter for Split Skip

Heap Memory Usage

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add config to enable memory based split processing slow down. This can be enabled via `task.memory-based-slowdown-threshold`

agrawaldevesh · 2023-09-22T23:37:00Z

presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java

@@ -254,6 +271,8 @@ public TaskExecutor(
            Duration interruptSplitInterval,
            EmbedVersion embedVersion,
            MultilevelSplitQueue splitQueue,
+            boolean memoryBasedSlowDownEnabled,
+            double memoryBasedSlowDownThreshold,


Can we have just one config and somehow infer enabled/disabled from that one ?

agrawaldevesh · 2023-09-22T23:37:35Z

presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java

@@ -530,6 +551,13 @@ private synchronized void startSplit(PrioritizedSplitRunner split)

    private synchronized PrioritizedSplitRunner pollNextSplitWorker()
    {
+        if (memoryBasedSlowDownEnabled) {
+            MemoryUsage memoryUsage = MEMORY_MX_BEAN.getHeapMemoryUsage();
+            if (memoryUsage.getUsed() > memoryUsage.getCommitted() * memoryBasedSlowDownThreshold) {


Can you jmh microbenchmark the cost of getting this bean calls -- I am concerned about introducing this in the hot path for each split.

I don't think this is where you should plug in: This will simply allow running the existing splits in the waitingSplits -- all this is does is not release any more "scan/leaf" splits to mlsq -- it does nothing about intermediate splits eg.

Did you consider teaching the MLSQ about this policy instead ? It will naturally stall the split runner threads.

the point in time checks might not be representative of the overall health. If I am not wrong the memory allocation will follow a sawtooth pattern.
Relevant article from Elasticsearch guys where they rely on the fill rate of old gen to detect memory pressure.
https://www.elastic.co/blog/found-understanding-memory-pressure-indicator

The memory usage is pulled out into a seperate thread so it won't impact the task/split processing.

tdcmeehan · 2023-09-26T14:43:15Z

Interesting work! Could we create an issue describing the design? I am curious how slowing down split processing will interact with memory management and how it will avoid deadlock.

swapsmagic · 2023-09-27T21:20:04Z

Interesting work! Could we create an issue describing the design? I am curious how slowing down split processing will interact with memory management and how it will avoid deadlock.

Added details in the description with more details.

swapsmagic · 2023-09-27T23:12:28Z

how it will avoid deadlock.
@tdcmeehan can you share more details on what kind of deadlock you are anticipating in this scenario?

tdcmeehan · 2023-09-28T00:09:07Z

I am trying to understand the idea and don't know if it will cause deadlock.

But deadlock crossed my mind because it seems this is taking a different route to tackle the same problem as the memory management framework: https://prestodb.io/blog/2019/08/19/memory-tracking

However, that framework also solves the larger problem: when one pool is blocked, to avoid the cluster from deadlocking and not making progress overall, it can kill queries (or promote them to the reserved pool).

If my idea about this change is correct, it seems this solves the problem locally but not the global problem. And if that's the case, suppose this framework kicks in: couldn't it cause either a slowdown or a lockup on one worker, which could cause a cluster-wide slowdown as well (since the cluster would be bottlenecked by the one single worker)?

Let me know if I'm missing some context or not understanding correctly.

swapsmagic · 2023-09-28T02:51:55Z

I am trying to understand the idea and don't know if it will cause deadlock.

As provided details in the description: we are noticing our low memory workers are crashing due to OOM. And during the time the workers were having higher number of splits (running + waiting). Given the workers are already using lot more memory, we want to prevent them from doing lot more work. And to achieve that we are introducing this low memory flag, that prevents worker from running splits if the previous one finished or a new one being submitted as part of the task update requests.

But deadlock crossed my mind because it seems this is taking a different route to tackle the same problem as the memory management framework: https://prestodb.io/blog/2019/08/19/memory-tracking

However, that framework also solves the larger problem: when one pool is blocked, to avoid the cluster from deadlocking and not making progress overall, it can kill queries (or promote them to the reserved pool).

For memory framework, true that it handles the memory tracking. But we at Meta noticed the accounted memory in our MemoryPool is way lower than the actual heap usage. And that means we can't rely on the memory framework to kick in and kill queries given the gap between how much memory is free as per Heap vs how much memory is free as per memory tracking framework is far off.

If my idea about this change is correct, it seems this solves the problem locally but not the global problem. And if that's the case, suppose this framework kicks in: couldn't it cause either a slowdown or a lockup on one worker, which could cause a cluster-wide slowdown as well (since the cluster would be bottlenecked by the one single worker)?

Let me know if I'm missing some context or not understanding correctly.

The change is not trying to solve overall cluster level memory issue but a local issue where worker is running hot resulting it being running out of memory. To prevent that, we want the worker to do less work for the time being till the memory frees up and pick up the work again. Sure this will slow down but at the cost of reliability. Without slowing down the cluster, worker can crash and resulting in lot more query failures which we want to prevent in this scenario. Also the feature is config driven so can be tuned as per the need and can be disabled as well. Hope this explains the rational behind this PR.

MnO2 · 2023-09-28T14:17:46Z

I am trying to understand the idea and don't know if it will cause deadlock.

But deadlock crossed my mind because it seems this is taking a different route to tackle the same problem as the memory management framework: https://prestodb.io/blog/2019/08/19/memory-tracking

However, that framework also solves the larger problem: when one pool is blocked, to avoid the cluster from deadlocking and not making progress overall, it can kill queries (or promote them to the reserved pool).

If my idea about this change is correct, it seems this solves the problem locally but not the global problem. And if that's the case, suppose this framework kicks in: couldn't it cause either a slowdown or a lockup on one worker, which could cause a cluster-wide slowdown as well (since the cluster would be bottlenecked by the one single worker)?

Let me know if I'm missing some context or not understanding correctly.

@tdcmeehan We are looking for a stop-bleeding solution to solve the hot spot problem, mainly due to the split brain issue from the disaggregated coordinator. We already have many OKR metric violation this month (you know the UER :p). From the top of my head if I have to construct a case for the deadlock. It is when a query trying to do a hash join, and the build side accumulate very large data and make the hash table exceeds the threshold. I think that might lead to a dead lock. The hash table wouldn’t be released since the task is unfinished. and the hash join cannot proceed since no more leaf splits would get executed. However, when we set the local memory limit to be significantly lower than the threshold, the query should be killed before the threshold is reached. But this is from my speculation.

@swapsmagic If what I speculated holds true, I guess it would be great to add a foolproof measure when loading the config. For example: when the threshold is below the local memory limit then warns the users the potential consequence or suggest a correct config combination? Otherwise the PR looks good to me in general. For sure we need to address Tim's concern.

swapsmagic · 2023-09-28T16:43:42Z

@swapsmagic If what I speculated holds true, I guess it would be great to add a foolproof measure when loading the config. For example: when the threshold is below the local memory limit then warns the users the potential consequence or suggest a correct config combination? Otherwise the PR looks good to me in general. For sure we need to address Tim's concern.

Added a min value check so threshold can't be set too low and avoid having slow down or deadlock situation.

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

MnO2

It looks good to me.

agrawaldevesh

I want to confirm the semantics -- maybe you should call that out clearly in the PR description.

This PR does not pause running splits when there is memory pressure. It just doesn't take in any new leaf splits when that happens. Right ? So we should expect to see the waitingSplits eventually run dry, right ? The PR description says "slow down", but it is more like "pause taking in new work". ie its very binary.

i don't see any tests for this. Also, I think we should describe what improvements we saw with this in the PR test plan. Like how you created a memory pressure scenario and how did this PR help with that. Maybe you can show a graph of RunningSplits / WaitingSplits or describe the same in words. As a control plane change, it would be great to validate its effectiveness and responsiveness.

presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java

agrawaldevesh · 2023-09-28T20:25:20Z

presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java

+        //Worker skip processing split if jvm heap usage crosses configured threhold
+        //Helps reduce memory pressure on the worker and avoid OOMs
+        if (isLowMemory()) {
+            log.debug("Skip task scheduling due to low memory");


Would it be better to add this logging in the isLowMemory method so that relevant extra stats can be logged as well there, providing more insight into that decision.

Also maybe the counter update there as well to DRY out this code

Both places where we are checking low memory, put a different logging to help understand which flow is being skipped. This will help in debugging issues in the future and that's why it's a debug log and not info log.

presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java

presto-main/src/main/java/com/facebook/presto/server/ServerMainModule.java

presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

jainxrohit · 2023-10-03T04:52:51Z

@swapsmagic Can you please add some jmx metrics for the testing you did?

presto-main/src/main/java/com/facebook/presto/execution/TaskManagerConfig.java

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

agrawaldevesh · 2023-10-04T00:22:35Z

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

+        long memoryThreshold = (long) (maxMemory * threshold);
+
+        if (usedMemory > memoryThreshold) {
+            if (!taskExecutor.isLowMemory()) {


It maybe simpler to have the TaskExecutor do its own state keeping for low memory mode. It can ensure the idempotency you are shooting for in addition to the logging. So the call can become simpler here in terms of taskExecutor.setLowMemory(usedMemory, maxMemory, memoryThresh).

It should be a low overhead call since anyway it is done once every K milliseconds

ajaygeorge

Adding some comments.

ajaygeorge · 2023-10-03T23:02:51Z

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

+
+    public void checkLowMemory()
+    {
+        MemoryMXBean mbean = ManagementFactory.getMemoryMXBean();


you can initialize this once in the constructor.
private final MemoryMXBean memoryMXBean;
...
this.memoryMXBean = ManagementFactory.getMemoryMXBean();

ajaygeorge · 2023-10-03T23:04:13Z

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

+
+        if (usedMemory > memoryThreshold) {
+            if (!taskExecutor.isLowMemory()) {
+                log.debug("Enabling Low Memory: Used: " + usedMemory + " Max: " + maxMemory + " Threshold: " + memoryThreshold);


nit use parameterized logging.

ajaygeorge · 2023-10-03T23:04:18Z

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

+        }
+        else {
+            if (taskExecutor.isLowMemory()) {
+                log.debug("Enabling Low Memory: Used: " + usedMemory + " Max: " + maxMemory + " Threshold: " + memoryThreshold);


nit use parameterized logging. Also shouldn't the message be Disabling instead of Enabling

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

ajaygeorge · 2023-10-04T00:46:48Z

presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java


    private final TimeStat blockedQuantaWallTime = new TimeStat(MICROSECONDS);
    private final TimeStat unblockedQuantaWallTime = new TimeStat(MICROSECONDS);

    private volatile boolean closed;

+    private AtomicBoolean lowMemory = new AtomicBoolean(false);


Made the variable volatile boolean.

ajaygeorge · 2023-10-04T00:57:03Z

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java

+            if (!taskExecutor.isLowMemory()) {
+                log.debug("Enabling Low Memory: Used: " + usedMemory + " Max: " + maxMemory + " Threshold: " + memoryThreshold);
+                taskExecutor.setLowMemory(true);
+            }
+        }
+        else {
+            if (taskExecutor.isLowMemory()) {
+                log.debug("Enabling Low Memory: Used: " + usedMemory + " Max: " + maxMemory + " Threshold: " + memoryThreshold);
+                taskExecutor.setLowMemory(false);
+            }


check- then-set part is interesting. I think even with atomic booleans, the value can change from the time you check and then set. I'm not sure that will create subtle issues in this code. Please give it some thought.

Prefer using CAS operations to get rid of check- then-set.

if (usedMemory > memoryThreshold) { if (taskExecutor.getLowMemory().compareAndSet(false, true)) { log.debug("Enabling Low Memory: Used: " + usedMemory + " Max: " + maxMemory + " Threshold: " + memoryThreshold); } } else { if (taskExecutor.getLowMemory().compareAndSet(true, false)) { log.debug("Disabling Low Memory: Used: " + usedMemory + " Max: " + maxMemory + " Threshold: " + memoryThreshold); } }

Given single thread is doing the check and update, this should not be the issue.

jainxrohit

Looks good to me. Lets also mention about the follow up change in the memory framework we are planning to do in the description.

jainxrohit · 2023-10-04T16:58:15Z

presto-main/src/main/java/com/facebook/presto/memory/LowMemoryMonitor.java

+                taskExecutor.setLowMemory(true);
+            }
+        }
+        else {


nit: else and if can be merged.

ajaygeorge · 2023-10-04T18:40:55Z

presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java


    private final TimeStat blockedQuantaWallTime = new TimeStat(MICROSECONDS);
    private final TimeStat unblockedQuantaWallTime = new TimeStat(MICROSECONDS);

    private volatile boolean closed;

+    private volatile boolean lowMemory;


Please add a comment saying it should only be updated from 1 thread from the lowMemoryExecutor

ajaygeorge

LGTM

ajaygeorge · 2023-10-04T18:47:17Z

Please address the checkstyle failures as well.

Introducing heap memory usage based slow down of worker. When worker memory utilization increases to certain threshold, it will stop processing new splits till the memory usage goes below threshold. This will help avoid scenario when worker heap usage is too high but has more work to do and resulting in out of memory error.

tdcmeehan · 2023-10-05T19:56:07Z

As I mentioned, local slowdown of splits may cause distributed issues. Please, can we follow up this PR with two things:

Given the experimental nature of the changes, please prefix the config with experimental.
Please add a caveat in the ConfigDescription that this config setting may induce cluster slowdown or deadlock in certain conditions, at least until the experimental prefix remains.

swapsmagic · 2023-10-10T15:33:24Z

As I mentioned, local slowdown of splits may cause distributed issues. Please, can we follow up this PR with two things:

Given the experimental nature of the changes, please prefix the config with experimental.

Please add a caveat in the ConfigDescription that this config setting may induce cluster slowdown or deadlock in certain conditions, at least until the experimental prefix remains.

PR #21053 to address it. Blocked due to build is not succeeding due to unrelated reasons.

swapsmagic requested a review from a team as a code owner September 22, 2023 22:08

swapsmagic requested a review from presto-oss September 22, 2023 22:08

swapsmagic force-pushed the worker_to_process_based_on_heap_usage branch from d545b6a to d67cf20 Compare September 22, 2023 22:29

agrawaldevesh reviewed Sep 22, 2023

View reviewed changes

swapsmagic force-pushed the worker_to_process_based_on_heap_usage branch from d67cf20 to 067f149 Compare September 27, 2023 00:23

swapsmagic force-pushed the worker_to_process_based_on_heap_usage branch from 067f149 to cbe2cfd Compare September 28, 2023 16:31

ajaygeorge reviewed Sep 28, 2023

View reviewed changes

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java Outdated Show resolved Hide resolved

swapsmagic force-pushed the worker_to_process_based_on_heap_usage branch 2 times, most recently from 40e9c9c to e03f31d Compare September 28, 2023 18:47

MnO2 approved these changes Sep 28, 2023

View reviewed changes

agrawaldevesh reviewed Sep 28, 2023

View reviewed changes

swapsmagic force-pushed the worker_to_process_based_on_heap_usage branch from e03f31d to a9aee78 Compare October 2, 2023 18:51

jainxrohit self-requested a review October 3, 2023 04:53

jainxrohit reviewed Oct 3, 2023

View reviewed changes

presto-main/src/main/java/com/facebook/presto/execution/TaskManagerConfig.java Outdated Show resolved Hide resolved

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java Outdated Show resolved Hide resolved

jainxrohit reviewed Oct 3, 2023

View reviewed changes

presto-main/src/main/java/com/facebook/presto/util/LowMemoryMonitor.java Outdated Show resolved Hide resolved

swapsmagic force-pushed the worker_to_process_based_on_heap_usage branch 2 times, most recently from 01cc3ff to 11628f0 Compare October 3, 2023 22:38

agrawaldevesh reviewed Oct 4, 2023

View reviewed changes

ajaygeorge reviewed Oct 4, 2023

View reviewed changes

swapsmagic force-pushed the worker_to_process_based_on_heap_usage branch 2 times, most recently from a7ada32 to 2f4c54a Compare October 4, 2023 16:36

swapsmagic changed the title ~~Heap Memory Based Worker Split Processing Slow Down~~ Heap Memory Based Worker Flag to stop processing new split when in low memory Oct 4, 2023

jainxrohit self-requested a review October 4, 2023 16:48

jainxrohit approved these changes Oct 4, 2023

View reviewed changes

ajaygeorge reviewed Oct 4, 2023

View reviewed changes

ajaygeorge approved these changes Oct 4, 2023

View reviewed changes

swapsmagic force-pushed the worker_to_process_based_on_heap_usage branch from 2f4c54a to ff3cbb8 Compare October 4, 2023 18:56

swapsmagic merged commit 0010d55 into prestodb:master Oct 5, 2023

wanglinsong mentioned this pull request Dec 8, 2023

Add release notes for 0.285 #21500

Closed

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heap Memory Based Worker Flag to stop processing new split when in low memory #20946

Heap Memory Based Worker Flag to stop processing new split when in low memory #20946

swapsmagic commented Sep 22, 2023 •

edited

Loading

agrawaldevesh Sep 22, 2023

agrawaldevesh Sep 22, 2023

agrawaldevesh Sep 22, 2023

ajaygeorge Sep 25, 2023

swapsmagic Sep 27, 2023

tdcmeehan commented Sep 26, 2023

swapsmagic commented Sep 27, 2023

swapsmagic commented Sep 27, 2023

tdcmeehan commented Sep 28, 2023

swapsmagic commented Sep 28, 2023 •

edited

Loading

MnO2 commented Sep 28, 2023

swapsmagic commented Sep 28, 2023

MnO2 left a comment

agrawaldevesh left a comment

agrawaldevesh Sep 28, 2023

swapsmagic Oct 2, 2023

jainxrohit commented Oct 3, 2023

agrawaldevesh Oct 4, 2023

ajaygeorge left a comment

ajaygeorge Oct 3, 2023

ajaygeorge Oct 3, 2023

ajaygeorge Oct 3, 2023

ajaygeorge Oct 4, 2023

swapsmagic Oct 4, 2023

ajaygeorge Oct 4, 2023

swapsmagic Oct 4, 2023

jainxrohit left a comment

jainxrohit Oct 4, 2023

ajaygeorge Oct 4, 2023

ajaygeorge left a comment

ajaygeorge commented Oct 4, 2023

tdcmeehan commented Oct 5, 2023

swapsmagic commented Oct 10, 2023

Heap Memory Based Worker Flag to stop processing new split when in low memory #20946

Heap Memory Based Worker Flag to stop processing new split when in low memory #20946

Conversation

swapsmagic commented Sep 22, 2023 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdcmeehan commented Sep 26, 2023

swapsmagic commented Sep 27, 2023

swapsmagic commented Sep 27, 2023

tdcmeehan commented Sep 28, 2023

swapsmagic commented Sep 28, 2023 • edited Loading

MnO2 commented Sep 28, 2023

swapsmagic commented Sep 28, 2023

MnO2 left a comment

Choose a reason for hiding this comment

agrawaldevesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jainxrohit commented Oct 3, 2023

Choose a reason for hiding this comment

ajaygeorge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jainxrohit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajaygeorge left a comment

Choose a reason for hiding this comment

ajaygeorge commented Oct 4, 2023

tdcmeehan commented Oct 5, 2023

swapsmagic commented Oct 10, 2023

swapsmagic commented Sep 22, 2023 •

edited

Loading

swapsmagic commented Sep 28, 2023 •

edited

Loading