-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7989][Core][Tests] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite #6546
Conversation
Test build #33872 has finished for PR 6546 at commit
|
@zsxwing are these things you identify actually causing test failures for you or in Jenkins or is this theoretical? |
Test build #33885 timed out for PR 6546 at commit |
retest this please |
Test build #33888 has finished for PR 6546 at commit
|
if this is just for testing, how about we put it into a SparkListener in the tests instead, and avoid changing |
The problem of a new SparkListener is "adding an Executor" may happen before "adding the new SparkListener to SparkContext". So the new SparkListener may miss some messages before adding itself to SparkContext. |
hmm, good point. You can make sure listeners are added immediately with |
retest this please |
@@ -55,6 +55,14 @@ class ExternalShuffleServiceSuite extends ShuffleSuite with BeforeAndAfterAll { | |||
sc.env.blockManager.externalShuffleServiceEnabled should equal(true) | |||
sc.env.blockManager.shuffleClient.getClass should equal(classOf[ExternalShuffleClient]) | |||
|
|||
// In a slow machine, one slave may register hundreds of milliseconds ahead of the other one. | |||
// If we don't wait for all salves, it's possible that only one executor runs all jobs. Then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
salves
@zsxwing IIUC this only adds some wait logic for the two test suites, but the intent is that we don't change the logic in |
master 1.4 |
…Suite and SparkListenerWithClusterSuite The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs. This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up. Author: zsxwing <[email protected]> Closes #6546 from zsxwing/SPARK-7989 and squashes the following commits: 5560e09 [zsxwing] Fix a typo 3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite (cherry picked from commit f271347) Signed-off-by: Andrew Or <[email protected]> Conflicts: core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala
Test build #34116 timed out for PR 6546 at commit |
Right. |
…Suite and SparkListenerWithClusterSuite The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs. This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up. Author: zsxwing <[email protected]> Closes apache#6546 from zsxwing/SPARK-7989 and squashes the following commits: 5560e09 [zsxwing] Fix a typo 3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
…Suite and SparkListenerWithClusterSuite The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs. This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up. Author: zsxwing <[email protected]> Closes apache#6546 from zsxwing/SPARK-7989 and squashes the following commits: 5560e09 [zsxwing] Fix a typo 3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.
This PR adds
JobProgressListener.waitUntilExecutorsUp
. The tests for the cluster mode can use it to wait until the expected executors are up.