-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27812][K8S] Bump K8S client version to 4.6.1 #26093
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @igorcalabria .
Currently, our testing Jenkins server is down. I'll review this PR after Jenkins is back.
@igorcalabria thank you for writing this patch I have ran this internally on my cluster and can confirm that non-daemon threads no longer block the JVM from exiting. Against branch 2.4 and master:I see the following non-daemon threads exist when running
After the python process exited, the JVM process remained and I see the following using
Against this patch:I no longer see the following non-daemon threads and the JVM is able to exit after the completion of the python process. @mccheah @erikerlandson @felixcheung @holdenk we should backport this fix to branch 2.4 as well and cut a new release with this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes regarding consistency of kubernetes-client version and the bump of okhttp
as well
@@ -29,7 +29,7 @@ | |||
<name>Spark Project Kubernetes</name> | |||
<properties> | |||
<sbt.project.name>kubernetes</sbt.project.name> | |||
<kubernetes.client.version>4.4.2</kubernetes.client.version> | |||
<kubernetes.client.version>4.6.0</kubernetes.client.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to change the kubernetes.client.version
in resource-managers/kubernetes/integration-tests/pom.xml
as well.
and bump:
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
- <version>3.8.1</version>
+ <version>3.12.0</version>
</dependency>
as I believe there is a dependency here https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.6.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Is there a reason why okhttp is added as an explicit dependency(kubernetes-client already brings it)? If we really need a specific version, shouldn't we at least exclude it from kubernetes-client
? I'm saying this because it's very easy to end up with a unexpected version on the classpath. kubernetes-client
uses okhttp's version 3.12 since version 4.1.2 but we're only updating it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may have been to make sure that another dependency on okhttp didn't win out and override to a lower version. It can just be updated here, or you can try removing it and examining the transitive dependencies afterwards. If it's already at 3.12.0, then it can likely be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, @mccheah do we have an opinion on this? I am okay with removing the explicit dependency since kubernetes-client
already brings it in. If there was a reason, I can't seem to remember, to explicitly state, let's exclude it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks okay to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed we should remove explicit dep to avoid a mismatch failure in the future
ok to test |
@igorcalabria . The following claims is not matched with your PR description.
Please describe how to test this for the following for the reviewer. (Technically, this PR need a test case for that. Or, at least, manual testing process should be described in the PR description and should be the commit log.)
|
Test build #111979 has finished for PR 26093 at commit
|
@dongjoon-hyun A manual test is described above by @ifilonenko. If you want, I could add an integration test for this specific case. I skipped it because other PRs updating kubernetes-client version didn't add any tests and the README in |
this matches the version used by kubernetes-client
Test build #112032 has finished for PR 26093 at commit
|
If it's easy to write a test for this, OK. If it's complex or would take a long time to run, I can see just verifying the small behavior change with a manual test. |
Kubernetes integration test starting |
Kubernetes integration test status success |
It's pretty difficult to cover all the code paths for changes in a downstream library. @dongjoon-hyun @srowen is there a standard protocol requiring an additional test for downstream dependencies? In this case, we're picking up a fix to an internal implementation detail in the downstream library, and I'd imagine we'd trust the downstream library to have already done its due diligence to test their fix. Regression tests exist to catch new problems introduced by the dependency bumps as we go along. I'd agree that a manual test demonstrating the desired end behavior is sufficient here. |
Kubernetes integration test starting |
Kubernetes integration test status success |
Hi, @igorcalabria and @mccheah .
|
Test build #112035 has finished for PR 26093 at commit
|
+1 on my end, I have tested using manual tests. An integration test seems like a bit of overkill, although a simple jstack check in the test would be sufficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, All.
Upgrading versions are okay, but the problem of this PR is to claim too much without a concrete background. This PR doesn't provide the way how to verify the claims in the PR description at all. Since the PR description will be a permanent commit log, we should not merge this PR AS-IS.
I'd like to recommend to remove all the claims from the PR description and narrow down the focus to the bug fixes inside the client library itself. Then, this PR can be considered as a proper proposal.
Will there be a 2.4.5 ? |
@erikerlandson I think there should be. This bug will cause pyspark resources to hang unless some external process does resource reaping. @dongjoon-hyun I agree, the PR description should change |
There will be a 2.4.5 eventually, I'm sure. 2.4.x is a sort of "LTS" branch. |
Of course, @erikerlandson . As @srowen said, it's the Since Since |
@dongjoon-hyun I have updated the description to include a manual test. I believe that both jira issues fix claims are now supported by the manual test case provided in the description. Please let me know if you need anything else. Cheers, |
Thank you for updating, @igorcalabria ! I'll follow the verification steps and try to land this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, during reviews, I found that this became outdated already. Could you update this PR with the latest version, @igorcalabria .
Also, cc @jiangxb1987 since he is a release manager for |
I'm +1 on backporting even though it's a version change I think we need this in 2.4.X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending removal of explicit dep & jenkins.
Thanks for doing the manual verification @ifilonenko :)
Actually if were going to update to 4.6.1 (and it looks like might as well) @ifilonenko would you have the cycles to manually verify with that version as well? |
kubernetes-client already brings it and we don't need a specific version
@dongjoon-hyun bumped version to 4.6.1. Following other reviewers suggestions, I also removed explicit okhttp dep. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK pending tests
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #112221 has finished for PR 26093 at commit
|
Is this tested with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
Thank you so much everyone. |
# What changes were proposed in this pull request? Backport of #26093 to `branch-2.4` ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-27812 https://issues.apache.org/jira/browse/SPARK-27927 We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785 ### Does this PR introduce any user-facing change? No ### How was this patch tested? This patch was tested manually using a simple pyspark job ```python from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.getOrCreate() ``` The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running ``` "OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000] "OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000] ``` This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to. When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully Closes #26152 from igorcalabria/k8s-client-update-2.4. Authored-by: igor.calabria <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Updated kubernetes client.
Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927
We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785
Does this PR introduce any user-facing change?
Nope, it should be transparent to users
How was this patch tested?
This patch was tested manually using a simple pyspark job
The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running
This is caused by a bug on
kubernetes-client
library, which is fixed on the version that we are upgrading to.When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully