-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer #3259
[SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer #3259
Conversation
The old code caused memory leaks.
/cc @andrewor14 and @rxin for review. |
Test build #23342 has started for PR 3259 at commit
|
@@ -913,8 +918,10 @@ private[nio] class ConnectionManager( | |||
} | |||
} | |||
|
|||
val timoutTaskHandle = ackTimeoutMonitor.newTimeout(timeoutTask, ackTimeout, TimeUnit.SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timout?
LGTM aside from typo. |
Test build #23342 has finished for PR 3259 at commit
|
Test PASSed. |
LGTM too. |
Hi Josh, thanks for looking into this. I'm not sure this fully fixes the issue though. In my heap dump it looks like the memory leak is caused by the TimerTask holding a reference to the Promise, which would hold a reference to the message when it's set succesfully. A suggested fix would be to override cancel in the TimerTask to set the promise reference to null if the task is cancelled (and the run() method will not actually get to run). Edit: Oh I see you also changed to HashedWheelTimer which is much more prompt in removing canceled tasks, so this must be fine. Could be worth confirming this though.. Thanks |
Ah, good catch regarding the |
Add additional documentation.
@cristianopris I've updated this so that the TimerTask keeps a WeakReference to |
Test build #23390 has started for PR 3259 at commit
|
Test build #23390 has finished for PR 3259 at commit
|
Test PASSed. |
Latest changes LGTM |
val e = new IOException("sendMessageReliably failed because ack " + | ||
s"was not received within $ackTimeout sec") | ||
if (!promise.tryFailure(e)) { | ||
logWarning("Ignore error because promise is completed", e) | ||
Option(promiseReference.get) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not
val p = promiseReference.get
if (p == null) {
...
} else {
...
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually on the fence about this, but your comment tips me towards the == null
camp since it removes a level of nesting / indentation.
Test build #23420 has started for PR 3259 at commit
|
Test build #23420 has finished for PR 3259 at commit
|
Test PASSed. |
Merging in master & branch-1.2. Thanks! |
…sks; use HashedWheelTimer This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs. This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask. I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case. Thanks to cristianopris for narrowing down this issue! Author: Josh Rosen <[email protected]> Closes #3259 from JoshRosen/connection-manager-timeout-bugfix and squashes the following commits: afcc8d6 [Josh Rosen] Address rxin's review feedback. 2a2e92d [Josh Rosen] Keep only WeakReference to promise in TimerTask; 0f0913b [Josh Rosen] Spelling fix: timout => timeout 3200c33 [Josh Rosen] Use Netty HashedWheelTimer f847dd4 [Josh Rosen] Don't capture entire message in ACK timeout task. (cherry picked from commit 7850e0c) Signed-off-by: Reynold Xin <[email protected]>
We also need to fix this issue for branch-1.1 right? |
I don't think this merges cleanly into branch-1.1. Can one of you submit a pull request for that branch? Thanks. |
I opend #3321 for branch-1.1. |
This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs.
This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask. I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case.
Thanks to @cristianopris for narrowing down this issue!