-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3030] [PySpark] Reuse Python worker #2259
Conversation
@@ -50,7 +50,7 @@ echo "Running PySpark tests. Output is in python/unit-tests.log." | |||
|
|||
# Try to test with Python 2.6, since that's the minimum version that we support: | |||
if [ $(which python2.6) ]; then | |||
export PYSPARK_PYTHON="python2.6" | |||
export PYSPARK_PYTHON="pypy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this change got pulled in by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes:)
QA tests have started for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
Jenkins, retest this please. |
Jenkins, test this please. |
QA tests have started for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
Jenkins, retest this please. On Sat, Sep 6, 2014 at 12:40 AM, Apache Spark QA [email protected]
|
Jenkins, retest this please. |
Do you think worker re-use should be enabled by default? The only problem that I anticipate is for applications that share a single SparkContext with both Python and Scala processes; in these cases, the Python tasks may continue to hog resources (memory that's not used for caching RDDs) even after they complete. This seems like a rare use-case, though, so we could document this change and advise those users to disable this setting. I'm inclined to have it on by default, since it will be a huge performance win for the vast majority of PySpark users. |
It would be interesting to measure the end-to-end performance impact for more realistic jobs, especially ones that make use of large numbers of tasks and large broadcast variables. |
It's already enabled by default. I had added benchmark result in the description. |
QA tests have started for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
QA tests have started for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
Jenkins, retest this please. |
QA tests have started for PR 2259 at commit
|
Tests timed out after a configured wait of |
Jenkins, retest this please. |
QA tests have started for PR 2259 at commit
|
Tests timed out after a configured wait of |
You guys should time out the worker after some time period to avoid it always consuming resources. If we have that, I think it should be on by default -- in general it's best to minimize the number of different run configurations. However we may need to add a setting to keep the old behavior if some users have code that assumes the worker will shut down. |
@mateiz It will time out the worker after 1 minute. It will reuse worker by default, can be disabled by 'spark.python.worker.reuse = false', then it will shut down the worker after task complete immediately. |
QA tests have started for PR 2259 at commit
|
Tests timed out after a configured wait of |
Tests timed out after a configured wait of |
Hmm, I wonder why we're seeing these timeouts. It looks like both tests failed in |
yeah, I will investigate it locally. On Tue, Sep 9, 2014 at 8:53 PM, Josh Rosen [email protected] wrote:
|
Jenkins, retest this please. |
@JoshRosen The problem that will cause hanging has been fixed. |
QA tests have started for PR 2259 at commit
|
QA tests have started for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
QA tests have started for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
QA tests have started for PR 2259 at commit
|
QA tests have started for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
Conflicts: python/pyspark/serializers.py
QA tests have started for PR 2259 at commit
|
QA tests have finished for PR 2259 at commit
|
This looks good to me; merging it into master now. I wonder if we'll see a net reduction in Jenkins flakiness due to using significantly fewer ephemeral ports in PySpark after this patch... |
Yeah, the bad diffs are especially weird. |
Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.
This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.
For a job with broadcast (43M after compress):
It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.
It's enabled by default, could be disabled by
spark.python.worker.reuse = false
.