-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3786] [PySpark] speedup tests #2646
Conversation
self.sc.stop() | ||
self.assertRaises(Exception, lambda: SparkContext("an-invalid-master-name")) | ||
self.sc = SparkContext("local") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to ContextTests.
Test FAILed. |
QA tests have started for PR 2646 at commit
|
QA tests have finished for PR 2646 at commit
|
This is a great set of refactorings! Thanks for improving the consistency of the test suite names. |
@@ -152,7 +152,7 @@ def test_external_sort(self): | |||
self.assertGreater(shuffle.DiskBytesSpilled, last) | |||
|
|||
def test_external_sort_in_rdd(self): | |||
conf = SparkConf().set("spark.python.worker.memory", "1m") | |||
conf = SparkConf().set("spark.python.worker.memory", "10m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this test originally change the worker memory? Is the goal here to force spilling?
Maybe we could add an undocumented "always spill / always externalize" configuration option to force spilling irrespective of memory limits in order to test this code. I suppose that we still might want tests like this, though, to check that the memory usage monitoring is working correctly, although I suppose we could write a separate test that only tests the memory monitoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ask because I wonder whether increasing this value will change the behavior of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not necessary, given 10M, it will still spill, we had checked that in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, do you mean that test_external_sort
above already exercises the path where we spill? I was just curious because this test_external_sort_in_rdd
test doesn't check shuffle.DiskBytesSpilled
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm wrong, it's not checked in this case, so I‘d revert this changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't merged this PR, so I think you can revert it here.
Test FAILed. |
LGTM. Even though the last Jenkins test failed, the last patch is a one-character change, so I don't think we need to wait for a whole other Jenkins run before merging. Thanks for cleaning up these tests! |
Found below errors after add 'time' in run-tests |
What shell are you running it in? |
Hi @davies , |
This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which did not make sense. It also improve the performance of some cases, such as MergerTests and SortTests.
before this patch:
real 21m27.320s
user 4m42.967s
sys 0m17.343s
after this patch:
real 9m47.541s
user 2m12.947s
sys 0m14.543s
It almost cut the time by half.