-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor Thread #11546
Conversation
Whoops, my bad: I misunderstood this patch because it's missing a description. In the future, please submit a more complete description with your PR. |
|
||
if (releasedLocks.nonEmpty) { | ||
val errMsg = | ||
s"${releasedLocks.size} block locks were not released by TID = $taskId:\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...which would tend to argue that this should be named unreleasedLocks, not releasedLocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markhamstra These codes are from #10705 by JoshRosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoshRosen In my production environment, when the storage memory is full, there is a great probability of deadlock. It is a temporary patch because JoshRosen add a read/write lock for block in #10705 for Spark 2.0.
Two theads are removing the same block which result in deadlock.
BlockManager will first lock MemoryManager and wait to lock BlockInfo in function dropFromMemory, Execturo task lock BlockInfo and wait to lock MemoryManager calling memstore.remove(block) in function removeBlock or function removeOldBlocks.
So just use a ConcurrentHashMap to record the locks by tasks. In case of failure, release all lock after task complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markhamstra this is a little confusing but it's basically saying that all the locks are supposed to be released by now, so if we have to release any locks here (i.e. if releasedLocks.nonEmpty
) then something it wrong.
Any word on if this, or any other fix for SPARK-13566 will make it into 1.6.2? |
@jamesecahill I don't know whether @JoshRosen will provide any other patch for this issue. But I have fixed this bug in my production environment by this PR. |
ok to test |
Test build #57824 has finished for PR 11546 at commit
|
@@ -1051,11 +1058,13 @@ private[spark] class BlockManager( | |||
} | |||
blockIsUpdated = true | |||
} | |||
pendingToRemove.put(blockId, currentTaskAttemptId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This put should happen as soon as we check containsKey
in L1035. Between now and then many things could happen; another thread may find that containsKey
is false at the same time and both threads can still end up trying to drop the block.
@cenyuhai Thanks for fixing this issue. The two offending locks here seem to be Once you address the comments I left I will go ahead and merge this. |
Test build #57975 has finished for PR 11546 at commit
|
Merge remote-tracking branch 'remotes/apache/branch-1.6' into SPARK-13566 # Conflicts: # core/src/main/scala/org/apache/spark/storage/BlockManager.scala
Test build #57982 has finished for PR 11546 at commit
|
@andrewor14 I alter the code as what you said, but the test failed because of timeout. It seems like that it is none of my business... |
ok to test |
retest this please |
The changes themselves LGTM. I'm not sure why it timed out so let's try again. |
There was an incorrect merge conflict a3aa22a in the scheduler so maybe that is related. |
Test build #58014 has finished for PR 11546 at commit
|
Thanks, I'm merging this into branch-1.6. |
…Thread Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread. Author: cenyuhai <[email protected]> Closes #11546 from cenyuhai/SPARK-13566.
@cenyuhai this is now merged. Can you close it? |
…and Executor Thread apache#11546" This reverts commit ac8ac58.
…Thread Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread. Author: cenyuhai <[email protected]> Closes apache#11546 from cenyuhai/SPARK-13566. (cherry picked from commit ab00652)
…Thread Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread. Author: cenyuhai <[email protected]> Closes apache#11546 from cenyuhai/SPARK-13566.
Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread.