[SPARK-35303][PYTHON] Enable pinned thread mode by default #32429

HyukjinKwon · 2021-05-04T04:21:44Z

What changes were proposed in this pull request?

PySpark added pinned thread mode at #24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default.

Why are the changes needed?

To correctly support parallel job submission and management.

Does this PR introduce any user-facing change?

Yes, now Python thread is mapped to JVM thread one to one.

How was this patch tested?

Existing tests should cover it.

HyukjinKwon · 2021-05-04T04:22:43Z

There are couple of todos such as updating migration guide so I marked it as a draft. I will take a look more and see if there are potential side effects to warn users.

dongjoon-hyun

Looks reasonable for Apache Spark 3.2.0. I'll look forward to seeing the migration guide.

HyukjinKwon · 2021-06-17T04:06:43Z

Thanks @srowen and @dongjoon-hyun. I will update the migration guide soon.

HyukjinKwon · 2021-06-17T07:19:44Z

This PR is ready for a review. cc @WeichenXu123 too FYI

HyukjinKwon · 2021-06-17T13:43:21Z

retest this please

SparkQA · 2021-06-17T14:59:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44451/

SparkQA · 2021-06-17T15:08:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44451/

SparkQA · 2021-06-17T16:56:55Z

Test build #139924 has finished for PR 32429 at commit 1fa54fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-06-18T02:59:21Z

Okay, if the users are using a ThreadPool, they won't get affected a lot by this change since the threads are reused. With this change, it could only affect the users who lunches Spark jobs in a plain thread a lot, which is discouraged in practice anyway.

Let me merge this and try it out. The change here is correct in principle.

HyukjinKwon · 2021-06-18T03:02:22Z

Merged to master.

…hen starting the thread, and use inheritable thread in the current codebase ### What changes were proposed in this pull request? This PR is a followup of #32429 and #32644. I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together. This PR includes: - Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at #32429) - Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644). - #32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default. ### Why are the changes needed? To mimic the JVM behaviour about thread lifecycle ### Does this PR introduce _any_ user-facing change? Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled. In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object. This is a small difference that wouldn't likely affect end users. ### How was this patch tested? Existing tests should cover this. Closes #32962 from HyukjinKwon/SPARK-35498-SPARK-35303. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

pratyush-prateek · 2024-02-09T06:09:49Z

@HyukjinKwon So pinned thread mode is enabled by default in spark 3.2.0 onwards? The only thing users need to do if they need to spawn threads, is use InheritableThread API, can we say that?

Okay, if the users are using a ThreadPool, they won't get affected a lot by this change since the threads are reused. With this change, it could only affect the users who lunches Spark jobs in a plain thread a lot, which is discouraged in practice anyway.

Let me merge this and try it out. The change here is correct in principle.

Also, how using a ThreadPool wont' affect? I am guessing you are talking about multiprocessing.pool.Threadpool module.

HyukjinKwon · 2024-02-09T06:53:16Z

Yes.

For ThreadPool, you should wrap your function with inheritable_thread_target

HyukjinKwon marked this pull request as draft May 4, 2021 04:21

github-actions bot added CORE DOCS PYTHON labels May 4, 2021

dongjoon-hyun reviewed May 4, 2021

View reviewed changes