-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23637][YARN]Yarn might allocate more resource if a same executor is killed multiple times. #20781
Conversation
…or is killed multiple times.
Test build #88116 has finished for PR 20781 at commit
|
cc @vanzin @tgravescs @cloud-fan @djvulee |
NVM.
Can you please describe how this happened? |
@jerryshao Thanks for taking look. Yes, it does happen. we have jobs which have already finished all the tasks but still holding 40~100 executors. Well I'm not sure if it exists in non dynamic scenario. |
This basically means that drive send multiple same kill requests to AM, right? I'm wondering how this would happen, shall we also guarantee this in the driver side? |
@jerryshao I come to find that it's possible |
Since the change for |
Test build #88127 has finished for PR 20781 at commit
|
@@ -81,7 +81,7 @@ private[yarn] class YarnAllocator( | |||
private val releasedContainers = Collections.newSetFromMap[ContainerId]( | |||
new ConcurrentHashMap[ContainerId, java.lang.Boolean]) | |||
|
|||
private val numExecutorsRunning = new AtomicInteger(0) | |||
private val runningExecutors = new java.util.concurrent.ConcurrentHashMap[String, Unit]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be changed to Collections.newSetFromMap
, since we only need Set
instead of Map
.
Still I'm not so sure about the root cause, but adding defensive code seems no harm. |
@jerryshao |
Test build #88178 has finished for PR 20781 at commit
|
The change looks good, but did you look at why the code is trying to kill the same executor multiple times? That sounds like it could be a possible bug on the scheduler backend, which should be keeping track of these things. |
@vanzin
|
retest this please |
Test build #88837 has finished for PR 20781 at commit
|
Merging to master / 2.3. |
…tor is killed multiple times. ## What changes were proposed in this pull request? `YarnAllocator` uses `numExecutorsRunning` to track the number of running executor. `numExecutorsRunning` is used to check if there're executors missing and need to allocate more. In current code, `numExecutorsRunning` can be negative when driver asks to kill a same idle executor multiple times. ## How was this patch tested? UT added Author: jinxing <[email protected]> Closes #20781 from jinxing64/SPARK-23637. (cherry picked from commit d3bd043) Signed-off-by: Marcelo Vanzin <[email protected]>
…tor is killed multiple times. ## What changes were proposed in this pull request? `YarnAllocator` uses `numExecutorsRunning` to track the number of running executor. `numExecutorsRunning` is used to check if there're executors missing and need to allocate more. In current code, `numExecutorsRunning` can be negative when driver asks to kill a same idle executor multiple times. ## How was this patch tested? UT added Author: jinxing <[email protected]> Closes apache#20781 from jinxing64/SPARK-23637.
…tor is killed multiple times. ## What changes were proposed in this pull request? `YarnAllocator` uses `numExecutorsRunning` to track the number of running executor. `numExecutorsRunning` is used to check if there're executors missing and need to allocate more. In current code, `numExecutorsRunning` can be negative when driver asks to kill a same idle executor multiple times. ## How was this patch tested? UT added Author: jinxing <[email protected]> Closes apache#20781 from jinxing64/SPARK-23637.
@vanzin Thanks for merging. |
…tor is killed multiple times. `YarnAllocator` uses `numExecutorsRunning` to track the number of running executor. `numExecutorsRunning` is used to check if there're executors missing and need to allocate more. In current code, `numExecutorsRunning` can be negative when driver asks to kill a same idle executor multiple times. UT added Author: jinxing <[email protected]> Closes apache#20781 from jinxing64/SPARK-23637. (cherry picked from commit d3bd043) Signed-off-by: Marcelo Vanzin <[email protected]> Change-Id: I5b70fa55343828b303f96e0672525259f33e43ab
What changes were proposed in this pull request?
YarnAllocator
usesnumExecutorsRunning
to track the number of running executor.numExecutorsRunning
is used to check if there're executors missing and need to allocate more.In current code,
numExecutorsRunning
can be negative when driver asks to kill a same idle executor multiple times.How was this patch tested?
UT added