Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3786] [PySpark] speedup tests #2646

Closed
wants to merge 2 commits into from
Closed

Conversation

davies
Copy link
Contributor

@davies davies commented Oct 3, 2014

This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which did not make sense. It also improve the performance of some cases, such as MergerTests and SortTests.

before this patch:

real 21m27.320s
user 4m42.967s
sys 0m17.343s

after this patch:

real 9m47.541s
user 2m12.947s
sys 0m14.543s

It almost cut the time by half.

self.sc.stop()
self.assertRaises(Exception, lambda: SparkContext("an-invalid-master-name"))
self.sc = SparkContext("local")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to ContextTests.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21274/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 3, 2014

QA tests have started for PR 2646 at commit 6a2a4b0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 3, 2014

QA tests have finished for PR 2646 at commit 6a2a4b0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

This is a great set of refactorings! Thanks for improving the consistency of the test suite names.

@@ -152,7 +152,7 @@ def test_external_sort(self):
self.assertGreater(shuffle.DiskBytesSpilled, last)

def test_external_sort_in_rdd(self):
conf = SparkConf().set("spark.python.worker.memory", "1m")
conf = SparkConf().set("spark.python.worker.memory", "10m")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this test originally change the worker memory? Is the goal here to force spilling?

Maybe we could add an undocumented "always spill / always externalize" configuration option to force spilling irrespective of memory limits in order to test this code. I suppose that we still might want tests like this, though, to check that the memory usage monitoring is working correctly, although I suppose we could write a separate test that only tests the memory monitoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ask because I wonder whether increasing this value will change the behavior of the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not necessary, given 10M, it will still spill, we had checked that in the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, do you mean that test_external_sort above already exercises the path where we spill? I was just curious because this test_external_sort_in_rdd test doesn't check shuffle.DiskBytesSpilled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm wrong, it's not checked in this case, so I‘d revert this changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't merged this PR, so I think you can revert it here.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21317/Test FAILed.

@JoshRosen
Copy link
Contributor

LGTM. Even though the last Jenkins test failed, the last patch is a one-character change, so I don't think we need to wait for a whole other Jenkins run before merging. Thanks for cleaning up these tests!

@asfgit asfgit closed this in 4f01265 Oct 6, 2014
@jameszhouyi
Copy link

Hi @davies @JoshRosen

Found below errors after add 'time' in run-tests
Running PySpark tests. Output is in python/unit-tests.log.
Testing with Python version:
Python 2.6.6
Run core tests ...
Running test: pyspark/rdd.py
./python/run-tests: line 37: time: command not found
./python/run-tests: line 37: time: command not found

@davies
Copy link
Contributor Author

davies commented Oct 7, 2014

What shell are you running it in?

@jameszhouyi
Copy link

Hi @davies ,
The error have been fixed via 'yum install time'. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants