-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC][K8S][INFRA] Enable resource limited Spark on K8S integration tests in Github Action #35830
Conversation
cc @dongjoon-hyun @HyukjinKwon @holdenk @yaooqinn @attilapiros WDYT? Do you think this is a work that can go on? |
As expected. |
@@ -191,6 +191,8 @@ class KubernetesSuite extends SparkFunSuite | |||
.set("spark.kubernetes.driver.pod.name", driverPodName) | |||
.set("spark.kubernetes.driver.label.spark-app-locator", appLocator) | |||
.set("spark.kubernetes.executor.label.spark-app-locator", appLocator) | |||
.set("spark.kubernetes.executor.request.cores", "0.5") | |||
.set("spark.kubernetes.driver.request.cores", "0.5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is only for tests, I'd like not to recommend this way. The driver has many threads and the flakiness is increasing when the driver CPU are capped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR. Although I understand your intention, I'm a little negative for this approach for now.
https://github.com/Yikun/spark/runs/5525624508?check_suite_focus=true With driver 0.5 CPU and executors 0.2 CPU only SparkRemoteFileTest failed. Kubernetes is using 0.85 CPU 0.25G memeory, so only 1.15 CPU left, we can give 0.5 for driver, 0.2 for others executor (3 is max limited). |
@Yikun . Fine tuning is just to aim to launch the tests, not for guaranteeing the stability. The minimum requirement of CI is stability. As you see, we are already hitting many flakiness issues in GitHub Action due to the memory issues and the committers are manually mitigating them by re-triggering . Unless we can get a bigger VM machine, this addition will add another complexity and instability to Apache Spark GitHub Action.
|
@dongjoon-hyun Thanks for your comments, this is only an initial exploring and share to enable K8S github action IT, I also think it is very important to ensure stability if we want to get this merged. |
If ASF allows the minikube job, we may revisit this and your Volcano test resource reducing PR at Apache Spark 3.4 timeline with a longer monitoring period. |
69c4993
to
65815bf
Compare
I think recent k8s ci breaks have proved that this CI is useful, it helps found 2 [1] [2] master bugs within 3 days, and also help some on regression test. @dongjoon-hyun I also contact with ASF, minikube is allowed in Apache projects, see: INFRA-23000. I also update the PR description to show demo |
cc @dongjoon-hyun @HyukjinKwon @holdenk @martin-g Would you mind take a look? I think now it's ready for review. |
@dongjoon-hyun Sure, next 3 days is day off in China, but I will try to some time address this. btw, other concerns are
|
Sure. This isn't urgent. Take your time. We didn't decide this main PR about new CI. We will discuss new spinoff PRs first. |
6dc006c
to
ef07343
Compare
Split two PRs as independent ones.
also enable minio test with |
…ver|executor)RequestCores` ### What changes were proposed in this pull request? This patch adds support for `spark.kubernetes.test.(driver|executor)RequestCores`, this help devs set specific cores info for (driver|executor)RequestCores. ### Why are the changes needed? In some cases (such as resource limited case), we want to set specifc `RequestCores`. See also: #35830 (review) ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? IT passed Closes #36087 from Yikun/SPARK-38802. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…s IT ### What changes were proposed in this pull request? This PR aims to set minio request cpu to `250m` (0.25). - This value also recommand in [link](https://docs.gitlab.com/charts/charts/minio/#installation-command-line-options). - There are [no cpu request limitation](https://github.com/minio/minio/blob/a3e317773a2b90a433136e1ff2a8394bc5017c75/helm/minio/values.yaml#L251) on current minio. ### Why are the changes needed? In some cases (such as resource limited case), we reduce request cpu of minio. See also: #35830 (review) ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? IT passsed Closes #36096 from Yikun/minioRequestCores. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Here is the complete spark on k8s test on Github Action and all test passed (also include volcano IT): It depends two patches:
|
Sorry for being late revisiting. Since we start to discuss Apache Spark 3.4 planning, shall we restart this for Apache Spark 3.4, @Yikun ? |
Also, cc @HyukjinKwon , too |
@dongjoon-hyun Thanks for reminder! It's definate YES for me. I will revisit it soon to make it work with latest CI infra. Given that we already have the docker image as part of the spark release, there is no doubt that it will not only help spark on k8s and also the docker image stable. |
To avoid the redundant of historical information, I submitted a new PR: #37244 |
…ver|executor)RequestCores` ### What changes were proposed in this pull request? This patch adds support for `spark.kubernetes.test.(driver|executor)RequestCores`, this help devs set specific cores info for (driver|executor)RequestCores. ### Why are the changes needed? In some cases (such as resource limited case), we want to set specifc `RequestCores`. See also: apache#35830 (review) ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? IT passed Closes apache#36087 from Yikun/SPARK-38802. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ver|executor)RequestCores` ### What changes were proposed in this pull request? This patch adds support for `spark.kubernetes.test.(driver|executor)RequestCores`, this help devs set specific cores info for (driver|executor)RequestCores. ### Why are the changes needed? In some cases (such as resource limited case), we want to set specifc `RequestCores`. See also: #35830 (review) ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? IT passed Closes #36087 from Yikun/SPARK-38802. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8396382) Signed-off-by: Dongjoon Hyun <[email protected]>
…s IT ### What changes were proposed in this pull request? This PR aims to set minio request cpu to `250m` (0.25). - This value also recommand in [link](https://docs.gitlab.com/charts/charts/minio/#installation-command-line-options). - There are [no cpu request limitation](https://github.com/minio/minio/blob/a3e317773a2b90a433136e1ff2a8394bc5017c75/helm/minio/values.yaml#L251) on current minio. ### Why are the changes needed? In some cases (such as resource limited case), we reduce request cpu of minio. See also: #35830 (review) ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? IT passsed Closes #36096 from Yikun/minioRequestCores. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 5ea2b38) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Enable K8S integration tests in Github Action.
Note that this PR didn't add Volcano test due to limited resource of github action.
For current implementations, we enable 2 style k8s ci:
on push: validation on each commits pushed (PR tirgger) and merged (commits post validation): test push demo
workflow_dispatch for manually PR/Branch/tags validation:
Why are the changes needed?
This will also improve the efficiency of K8S development and guarantee the quality of spark on K8S in some level.
Why setting driver 0.5 cpu, executor 0.2 cpu?
Github-hosted runner hardware limited: 2U7G
IT Job available CPU = 2U - 0.85U(K8S deploy) = 1.15U
There are 1.15 cpu left after k8s installation, to meet the requirement of K8S tests (one driver + max to 3 executors).
Time cost info:
Stablity:
I also do a stablity validation based on scheduled job:
https://github.com/spark-volcano/spark/actions/workflows/k8s_integration_test.yml
All test passed always passed at each half hours since enable.
Does this PR introduce any user-facing change?
No
How was this patch tested?
CI passed