[SPARK-3030] [PySpark] Reuse Python worker #2259

davies · 2014-09-04T01:13:46Z

Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.

This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.

For a job with broadcast (43M after compress):

    b = sc.broadcast(set(range(30000000)))
    print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count()

It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.

It's enabled by default, could be disabled by spark.python.worker.reuse = false.

JoshRosen · 2014-09-04T01:17:20Z

python/run-tests

@@ -50,7 +50,7 @@ echo "Running PySpark tests. Output is in python/unit-tests.log."

 # Try to test with Python 2.6, since that's the minimum version that we support:
 if [ $(which python2.6) ]; then
-    export PYSPARK_PYTHON="python2.6"
+    export PYSPARK_PYTHON="pypy"


Looks like this change got pulled in by accident?

SparkQA · 2014-09-04T02:34:19Z

QA tests have started for PR 2259 at commit ace2917.

This patch merges cleanly.

SparkQA · 2014-09-04T03:32:40Z

QA tests have finished for PR 2259 at commit ace2917.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class BlockManagerMaster(
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

davies · 2014-09-05T17:40:42Z

Jenkins, retest this please.

davies · 2014-09-06T06:24:01Z

Jenkins, test this please.

SparkQA · 2014-09-06T06:43:36Z

QA tests have started for PR 2259 at commit 583716e.

This patch merges cleanly.

SparkQA · 2014-09-06T07:39:56Z

QA tests have finished for PR 2259 at commit 583716e.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2014-09-06T16:24:02Z

Jenkins, retest this please.

On Sat, Sep 6, 2014 at 12:40 AM, Apache Spark QA [email protected]
wrote:

QA tests have finished
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19904/consoleFull
for PR 2259 at commit 583716e
583716e
.

This patch fails unit tests.

This patch merges cleanly.

This patch adds no public classes.

Reply to this email directly or view it on GitHub
#2259 (comment).

Davies

JoshRosen · 2014-09-06T21:40:42Z

Jenkins, retest this please.

JoshRosen · 2014-09-06T23:24:59Z

Do you think worker re-use should be enabled by default?

The only problem that I anticipate is for applications that share a single SparkContext with both Python and Scala processes; in these cases, the Python tasks may continue to hog resources (memory that's not used for caching RDDs) even after they complete. This seems like a rare use-case, though, so we could document this change and advise those users to disable this setting.

I'm inclined to have it on by default, since it will be a huge performance win for the vast majority of PySpark users.

JoshRosen · 2014-09-06T23:25:42Z

It would be interesting to measure the end-to-end performance impact for more realistic jobs, especially ones that make use of large numbers of tasks and large broadcast variables.

davies · 2014-09-07T00:36:48Z

It's already enabled by default. I had added benchmark result in the description.

SparkQA · 2014-09-07T00:43:19Z

QA tests have started for PR 2259 at commit e0131a2.

This patch merges cleanly.

SparkQA · 2014-09-07T01:42:44Z

QA tests have finished for PR 2259 at commit e0131a2.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-07T02:45:08Z

QA tests have started for PR 2259 at commit 6325fc1.

This patch merges cleanly.

SparkQA · 2014-09-07T03:43:17Z

QA tests have finished for PR 2259 at commit 6325fc1.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2014-09-07T05:56:00Z

Jenkins, retest this please.

SparkQA · 2014-09-07T06:45:28Z

QA tests have started for PR 2259 at commit 6325fc1.

This patch merges cleanly.

SparkQA · 2014-09-07T08:45:28Z

Tests timed out after a configured wait of 120m.

JoshRosen · 2014-09-07T18:31:37Z

Jenkins, retest this please.

SparkQA · 2014-09-07T18:46:21Z

QA tests have started for PR 2259 at commit 6325fc1.

This patch merges cleanly.

SparkQA · 2014-09-07T20:46:21Z

Tests timed out after a configured wait of 120m.

mateiz · 2014-09-08T01:02:26Z

You guys should time out the worker after some time period to avoid it always consuming resources. If we have that, I think it should be on by default -- in general it's best to minimize the number of different run configurations. However we may need to add a setting to keep the old behavior if some users have code that assumes the worker will shut down.

davies · 2014-09-08T04:06:50Z

@mateiz It will time out the worker after 1 minute. It will reuse worker by default, can be disabled by 'spark.python.worker.reuse = false', then it will shut down the worker after task complete immediately.

SparkQA · 2014-09-10T01:51:18Z

QA tests have started for PR 2259 at commit cf1c55e.

This patch merges cleanly.

SparkQA · 2014-09-10T02:51:33Z

Tests timed out after a configured wait of 120m.

SparkQA · 2014-09-10T03:51:19Z

Tests timed out after a configured wait of 120m.

JoshRosen · 2014-09-10T03:53:32Z

Hmm, I wonder why we're seeing these timeouts. It looks like both tests failed in recommendation.py, so it might be worth running those tests locally to see whether they're running way slower after this patch.

davies · 2014-09-10T03:56:36Z

yeah, I will investigate it locally.

On Tue, Sep 9, 2014 at 8:53 PM, Josh Rosen [email protected] wrote:

Hmm, I wonder why we're seeing these timeouts. It looks like both tests
failed in recommendation.py, so it might be worth running those tests
locally to see whether they're running way slower after this patch.

Reply to this email directly or view it on GitHub
#2259 (comment).

Davies

davies · 2014-09-10T06:47:47Z

Jenkins, retest this please.

davies · 2014-09-10T06:48:37Z

@JoshRosen The problem that will cause hanging has been fixed.

SparkQA · 2014-09-10T18:53:45Z

QA tests have started for PR 2259 at commit 3939f20.

This patch merges cleanly.

SparkQA · 2014-09-10T18:55:38Z

QA tests have started for PR 2259 at commit 3939f20.

This patch merges cleanly.

SparkQA · 2014-09-10T18:56:34Z

QA tests have finished for PR 2259 at commit 3939f20.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RatingDeserializer(FramedSerializer):

SparkQA · 2014-09-10T19:09:32Z

QA tests have started for PR 2259 at commit 3939f20.

This patch merges cleanly.

SparkQA · 2014-09-10T19:10:30Z

QA tests have finished for PR 2259 at commit 3939f20.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RatingDeserializer(FramedSerializer):

SparkQA · 2014-09-10T21:09:43Z

QA tests have started for PR 2259 at commit 3939f20.

This patch merges cleanly.

SparkQA · 2014-09-10T21:23:44Z

QA tests have started for PR 2259 at commit 3939f20.

This patch merges cleanly.

SparkQA · 2014-09-10T22:13:26Z

QA tests have finished for PR 2259 at commit 3939f20.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-10T22:33:03Z

QA tests have finished for PR 2259 at commit 3939f20.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: python/pyspark/serializers.py

SparkQA · 2014-09-13T06:04:23Z

QA tests have started for PR 2259 at commit f11f617.

This patch merges cleanly.

SparkQA · 2014-09-13T07:12:49Z

QA tests have finished for PR 2259 at commit f11f617.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JavaSparkContext(val sc: SparkContext)
- throw new IllegalStateException("The main method in the given main class must be static")
- class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception
- class Dummy(object):
- class RatingDeserializer(FramedSerializer):
- class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

JoshRosen · 2014-09-13T23:21:38Z

This looks good to me; merging it into master now. I wonder if we'll see a net reduction in Jenkins flakiness due to using significantly fewer ephemeral ports in PySpark after this patch...

nchammas · 2014-09-13T23:24:41Z

Yeah, the bad diffs are especially weird. class Dummy? Really? (Though I doubt this patch would have an impact on that.)

davies added 2 commits September 3, 2014 17:08

reuse python worker

8d2f08c

track broadcasts for each worker

6123d0f

JoshRosen reviewed Sep 4, 2014
View reviewed changes

kill python worker after timeout

ace2917

only reuse completed and not interrupted worker

583716e

fix name of config

e0131a2

bugfix: bid >= 0

6325fc1

davies added 2 commits September 9, 2014 17:53

fix accumulator with reused worker

3133a60

address comments

cf1c55e

fix bug in serializer in mllib

3939f20

Merge branch 'master' into reuse-worker

f11f617

Conflicts: python/pyspark/serializers.py

asfgit closed this in 2aea0da Sep 13, 2014

davies deleted the reuse-worker branch September 15, 2014 22:20

JoshRosen mentioned this pull request Nov 28, 2014

Optimize PySpark worker invocation mesos/spark#563

Merged

[SPARK-3030] [PySpark] Reuse Python worker #2259

[SPARK-3030] [PySpark] Reuse Python worker #2259

Conversation

davies commented Sep 4, 2014

JoshRosen Sep 4, 2014

Choose a reason for hiding this comment

davies Sep 4, 2014

Choose a reason for hiding this comment

SparkQA commented Sep 4, 2014

SparkQA commented Sep 4, 2014

davies commented Sep 5, 2014

davies commented Sep 6, 2014

SparkQA commented Sep 6, 2014

SparkQA commented Sep 6, 2014

davies commented Sep 6, 2014

JoshRosen commented Sep 6, 2014

JoshRosen commented Sep 6, 2014

JoshRosen commented Sep 6, 2014

davies commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

davies commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

JoshRosen commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

mateiz commented Sep 8, 2014

davies commented Sep 8, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

JoshRosen commented Sep 10, 2014

davies commented Sep 10, 2014

davies commented Sep 10, 2014

davies commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

JoshRosen commented Sep 13, 2014

nchammas commented Sep 13, 2014