Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7989][Core][Tests] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite #6546

Closed
wants to merge 2 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Jun 1, 2015

The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.

This PR adds JobProgressListener.waitUntilExecutorsUp. The tests for the cluster mode can use it to wait until the expected executors are up.

@SparkQA
Copy link

SparkQA commented Jun 1, 2015

Test build #33872 has finished for PR 6546 at commit 3b69840.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 1, 2015

@zsxwing are these things you identify actually causing test failures for you or in Jenkins or is this theoretical?

@zsxwing
Copy link
Member Author

zsxwing commented Jun 1, 2015

@srowen These two failures happened in my PR #6457. I fixed them in #6457 to pass Jenkins. But I think they are not a part of #6457 and it's better to fix them in a separate PR.

@SparkQA
Copy link

SparkQA commented Jun 1, 2015

Test build #33885 timed out for PR 6546 at commit 5560e09 after a configured wait of 150m.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 1, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Jun 1, 2015

Test build #33888 has finished for PR 6546 at commit 5560e09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jun 1, 2015

if this is just for testing, how about we put it into a SparkListener in the tests instead, and avoid changing JobProgressListener? You could put it into a common listener trait that could be used for both tests. The listener would just count how many times onExecutorAdded / onExecutorRemoved were called and would have a similar waitUntilExecutorsUp method.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 1, 2015

if this is just for testing, how about we put it into a SparkListener in the tests instead, and avoid changing JobProgressListener? You could put it into a common listener trait that could be used for both tests. The listener would just count how many times onExecutorAdded / onExecutorRemoved were called and would have a similar waitUntilExecutorsUp method.

The problem of a new SparkListener is "adding an Executor" may happen before "adding the new SparkListener to SparkContext". So the new SparkListener may miss some messages before adding itself to SparkContext. JobProgressListener is added as early as possible so that it won't miss any message.

@squito
Copy link
Contributor

squito commented Jun 2, 2015

The problem of a new SparkListener is "adding an Executor" may happen before "adding the new
SparkListener to SparkContext". So the new SparkListener may miss some messages before adding
itself to SparkContext. JobProgressListener is added as early as possible so that it won't miss any
message.

hmm, good point. You can make sure listeners are added immediately with spark.extraListeners, but that is admittedly a pain since you don't get a handle on our listener, so you need some to communicate via some global variable. I guess I don't feel strongly either way then.

@andrewor14
Copy link
Contributor

retest this please

@@ -55,6 +55,14 @@ class ExternalShuffleServiceSuite extends ShuffleSuite with BeforeAndAfterAll {
sc.env.blockManager.externalShuffleServiceEnabled should equal(true)
sc.env.blockManager.shuffleClient.getClass should equal(classOf[ExternalShuffleClient])

// In a slow machine, one slave may register hundreds of milliseconds ahead of the other one.
// If we don't wait for all salves, it's possible that only one executor runs all jobs. Then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

salves

@andrewor14
Copy link
Contributor

@zsxwing IIUC this only adds some wait logic for the two test suites, but the intent is that we don't change the logic in BroadcastSuite right?

@andrewor14
Copy link
Contributor

master 1.4

@asfgit asfgit closed this in f271347 Jun 3, 2015
asfgit pushed a commit that referenced this pull request Jun 3, 2015
…Suite and SparkListenerWithClusterSuite

The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.

This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.

Author: zsxwing <[email protected]>

Closes #6546 from zsxwing/SPARK-7989 and squashes the following commits:

5560e09 [zsxwing] Fix a typo
3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite

(cherry picked from commit f271347)
Signed-off-by: Andrew Or <[email protected]>

Conflicts:
	core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala
asfgit pushed a commit that referenced this pull request Jun 3, 2015
asfgit pushed a commit that referenced this pull request Jun 3, 2015
asfgit pushed a commit that referenced this pull request Jun 3, 2015
@SparkQA
Copy link

SparkQA commented Jun 4, 2015

Test build #34116 timed out for PR 6546 at commit 5560e09 after a configured wait of 150m.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 4, 2015

@zsxwing IIUC this only adds some wait logic for the two test suites, but the intent is that we don't change the logic in BroadcastSuite right?

Right. BroadcastSuite could also use this new method, so I updated it to simplify codes. And thanks for fixing the typos.

@zsxwing zsxwing deleted the SPARK-7989 branch June 4, 2015 03:09
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…Suite and SparkListenerWithClusterSuite

The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.

This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.

Author: zsxwing <[email protected]>

Closes apache#6546 from zsxwing/SPARK-7989 and squashes the following commits:

5560e09 [zsxwing] Fix a typo
3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…Suite and SparkListenerWithClusterSuite

The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.

This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.

Author: zsxwing <[email protected]>

Closes apache#6546 from zsxwing/SPARK-7989 and squashes the following commits:

5560e09 [zsxwing] Fix a typo
3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants