[SPARK-3786] [PySpark] speedup tests #2646

davies · 2014-10-03T21:27:14Z

This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which did not make sense. It also improve the performance of some cases, such as MergerTests and SortTests.

before this patch:

real 21m27.320s
user 4m42.967s
sys 0m17.343s

after this patch:

real 9m47.541s
user 2m12.947s
sys 0m14.543s

It almost cut the time by half.

davies · 2014-10-03T21:27:45Z

python/pyspark/tests.py

-        self.sc.stop()
-        self.assertRaises(Exception, lambda: SparkContext("an-invalid-master-name"))
-        self.sc = SparkContext("local")
-


move to ContextTests.

AmplabJenkins · 2014-10-03T21:42:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21274/Test FAILed.

SparkQA · 2014-10-03T22:32:30Z

QA tests have started for PR 2646 at commit 6a2a4b0.

This patch merges cleanly.

SparkQA · 2014-10-03T23:34:59Z

QA tests have finished for PR 2646 at commit 6a2a4b0.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-10-04T01:22:45Z

This is a great set of refactorings! Thanks for improving the consistency of the test suite names.

JoshRosen · 2014-10-04T01:25:02Z

python/pyspark/tests.py

@@ -152,7 +152,7 @@ def test_external_sort(self):
        self.assertGreater(shuffle.DiskBytesSpilled, last)

    def test_external_sort_in_rdd(self):
-        conf = SparkConf().set("spark.python.worker.memory", "1m")
+        conf = SparkConf().set("spark.python.worker.memory", "10m")


Why did this test originally change the worker memory? Is the goal here to force spilling?

Maybe we could add an undocumented "always spill / always externalize" configuration option to force spilling irrespective of memory limits in order to test this code. I suppose that we still might want tests like this, though, to check that the memory usage monitoring is working correctly, although I suppose we could write a separate test that only tests the memory monitoring.

I ask because I wonder whether increasing this value will change the behavior of the test.

This change is not necessary, given 10M, it will still spill, we had checked that in the test.

Ah, do you mean that test_external_sort above already exercises the path where we spill? I was just curious because this test_external_sort_in_rdd test doesn't check shuffle.DiskBytesSpilled.

Oh, I'm wrong, it's not checked in this case, so I‘d revert this changes.

I haven't merged this PR, so I think you can revert it here.

AmplabJenkins · 2014-10-06T04:47:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21317/Test FAILed.

JoshRosen · 2014-10-06T21:07:31Z

LGTM. Even though the last Jenkins test failed, the last patch is a one-character change, so I don't think we need to wait for a whole other Jenkins run before merging. Thanks for cleaning up these tests!

jameszhouyi · 2014-10-07T15:43:56Z

Hi @davies @JoshRosen

Found below errors after add 'time' in run-tests
Running PySpark tests. Output is in python/unit-tests.log.
Testing with Python version:
Python 2.6.6
Run core tests ...
Running test: pyspark/rdd.py
./python/run-tests: line 37: time: command not found
./python/run-tests: line 37: time: command not found

davies · 2014-10-07T16:35:09Z

What shell are you running it in?

jameszhouyi · 2014-10-08T04:27:30Z

Hi @davies ,
The error have been fixed via 'yum install time'. Thanks.

refactor of tests, speedup 100%

6a2a4b0

davies reviewed Oct 3, 2014
View reviewed changes

JoshRosen reviewed Oct 4, 2014
View reviewed changes

revert change about memory limit

c54de60

asfgit closed this in 4f01265 Oct 6, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3786] [PySpark] speedup tests #2646

[SPARK-3786] [PySpark] speedup tests #2646

davies commented Oct 3, 2014

davies Oct 3, 2014

AmplabJenkins commented Oct 3, 2014

SparkQA commented Oct 3, 2014

SparkQA commented Oct 3, 2014

JoshRosen commented Oct 4, 2014

JoshRosen Oct 4, 2014

JoshRosen Oct 4, 2014

davies Oct 5, 2014

JoshRosen Oct 5, 2014

davies Oct 5, 2014

JoshRosen Oct 6, 2014

AmplabJenkins commented Oct 6, 2014

JoshRosen commented Oct 6, 2014

jameszhouyi commented Oct 7, 2014

davies commented Oct 7, 2014

jameszhouyi commented Oct 8, 2014

[SPARK-3786] [PySpark] speedup tests #2646

[SPARK-3786] [PySpark] speedup tests #2646

Conversation

davies commented Oct 3, 2014

davies Oct 3, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Oct 3, 2014

SparkQA commented Oct 3, 2014

SparkQA commented Oct 3, 2014

JoshRosen commented Oct 4, 2014

JoshRosen Oct 4, 2014

Choose a reason for hiding this comment

JoshRosen Oct 4, 2014

Choose a reason for hiding this comment

davies Oct 5, 2014

Choose a reason for hiding this comment

JoshRosen Oct 5, 2014

Choose a reason for hiding this comment

davies Oct 5, 2014

Choose a reason for hiding this comment

JoshRosen Oct 6, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Oct 6, 2014

JoshRosen commented Oct 6, 2014

jameszhouyi commented Oct 7, 2014

davies commented Oct 7, 2014

jameszhouyi commented Oct 8, 2014