Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35303][PYTHON] Enable pinned thread mode by default #32429

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

PySpark added pinned thread mode at #24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default.

Why are the changes needed?

To correctly support parallel job submission and management.

Does this PR introduce any user-facing change?

Yes, now Python thread is mapped to JVM thread one to one.

How was this patch tested?

Existing tests should cover it.

@HyukjinKwon HyukjinKwon marked this pull request as draft May 4, 2021 04:21
@HyukjinKwon
Copy link
Member Author

There are couple of todos such as updating migration guide so I marked it as a draft. I will take a look more and see if there are potential side effects to warn users.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable for Apache Spark 3.2.0. I'll look forward to seeing the migration guide.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

Thanks @srowen and @dongjoon-hyun. I will update the migration guide soon.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-35303 branch 2 times, most recently from 2c2c519 to 59a13dd Compare June 17, 2021 05:27
@SparkQA

This comment has been minimized.

@HyukjinKwon HyukjinKwon marked this pull request as ready for review June 17, 2021 05:48
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-35303 branch 2 times, most recently from a9d3926 to ab7f2f6 Compare June 17, 2021 07:19
@HyukjinKwon
Copy link
Member Author

This PR is ready for a review. cc @WeichenXu123 too FYI

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44451/

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44451/

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Test build #139924 has finished for PR 32429 at commit 1fa54fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Okay, if the users are using a ThreadPool, they won't get affected a lot by this change since the threads are reused. With this change, it could only affect the users who lunches Spark jobs in a plain thread a lot, which is discouraged in practice anyway.

Let me merge this and try it out. The change here is correct in principle.

@HyukjinKwon
Copy link
Member Author

Merged to master.

HyukjinKwon added a commit that referenced this pull request Jun 20, 2021
…hen starting the thread, and use inheritable thread in the current codebase

### What changes were proposed in this pull request?

This PR is a followup of #32429 and #32644.
I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together.

This PR includes:
- Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at #32429)
- Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644).
- #32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default.

### Why are the changes needed?

To mimic the JVM behaviour about thread lifecycle

### Does this PR introduce _any_ user-facing change?

Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled.
In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object.
This is a small difference that wouldn't likely affect end users.

### How was this patch tested?

Existing tests should cover this.

Closes #32962 from HyukjinKwon/SPARK-35498-SPARK-35303.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-35303 branch January 4, 2022 00:53
@pratyush-prateek
Copy link

@HyukjinKwon So pinned thread mode is enabled by default in spark 3.2.0 onwards? The only thing users need to do if they need to spawn threads, is use InheritableThread API, can we say that?

Okay, if the users are using a ThreadPool, they won't get affected a lot by this change since the threads are reused. With this change, it could only affect the users who lunches Spark jobs in a plain thread a lot, which is discouraged in practice anyway.

Let me merge this and try it out. The change here is correct in principle.

Also, how using a ThreadPool wont' affect? I am guessing you are talking about multiprocessing.pool.Threadpool module.

@HyukjinKwon
Copy link
Member Author

Yes.

For ThreadPool, you should wrap your function with inheritable_thread_target

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants