Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39006][K8S] Show a directional error message for executor PVC dynamic allocation failure #36374

Closed
wants to merge 2 commits into from

Conversation

dcoliversun
Copy link
Contributor

What changes were proposed in this pull request?

This PR aims to show a directional error message for executor PVC dynamic allocation failure.

Why are the changes needed?

#29846 supports dynamic PVC creation/deletion for K8s executors.
#29557 support execId placeholder in executor PVC conf.
If not set spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName with onDemand or SPARK_EXECUTOR_ID, spark will continue to try to create the executor pod.
After this PR, spark can show a directional error message for this situation.

ERROR ExecutorPodsSnapshotsStoreImpl: Going to stop due to IllegalArgumentException
java.lang.IllegalArgumentException: PVC ClaimName should contain OnDemand or SPARK_EXECUTOR_ID when multiple executors are required

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit test.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for changing your direction. This is much better.

@@ -120,6 +122,20 @@ private[spark] class MountVolumesFeatureStep(conf: KubernetesConf)
override def getAdditionalKubernetesResources(): Seq[HasMetadata] = {
additionalResources.toSeq
}

private def checkPVCClaimNameWhenMultiExecutors(claimName: String): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkPVCClaimNameWhenMultiExecutors -> checkPVCClaimName because this PR checks all cases instead of WhenMultiExecutors and doesn't raise exception for single instance case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it as checkPVCClaimName :)

private def checkPVCClaimNameWhenMultiExecutors(claimName: String): Unit = {
val invalidClaimName =
if (!claimName.contains(PVC_ON_DEMAND) && !claimName.contains(ENV_EXECUTOR_ID)) true
else false
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misleading because this is totally valid when there is only one executor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, check it when multiple executors

else false

val executorInstances = conf.get(EXECUTOR_INSTANCES)
if (executorInstances.isEmpty) return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to avoid return in Scala method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, remove return

if (invalidClaimName && executorInstances.get > 1) {
throw new IllegalArgumentException("PVC ClaimName should contain " +
PVC_ON_DEMAND + " or " + ENV_EXECUTOR_ID +
" when multiple executors are required")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message looks informative but please include the current claim name in the error message too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added the current claim name

@@ -148,6 +149,26 @@ class MountVolumesFeatureStepSuite extends SparkFunSuite {
assert(executorPVC.getClaimName.endsWith("-exec-1-pvc-0"))
}

test("SPARK-39006 Show a directional error message for PVC Dynamic Allocation Failure") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We need :, e.g., SPARK-39006 -> SPARK-39006:.
  2. After this PR, we don't have PVC Dynamic Allocation Failure. Please revise the test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have two changes here:

  • Add : after SPARK-39006
  • Change test case as Check PVC ClaimName

mountReadOnly = true,
KubernetesPVCVolumeConf("testClaimName")
)
val conf = new SparkConf().set(EXECUTOR_INSTANCES, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a positive test case where executor instance is 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, added a positive test here

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. Thanks, @dcoliversun .

@dcoliversun
Copy link
Contributor Author

@dongjoon-hyun Thanks for your comments, I have updated code. Please review again

@dcoliversun dcoliversun requested a review from dongjoon-hyun May 4, 2022 02:12
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @dcoliversun .
Merged to master for Apache Spark 3.4.

@dcoliversun
Copy link
Contributor Author

Thank you @dongjoon-hyun

@0xdarkman
Copy link

can we expect the fix to be in next spark version release 3.4.1?

@dongjoon-hyun
Copy link
Member

can we expect the fix to be in next spark version release 3.4.1?

Could you be more specific, @0xdarkman ? Which JIRA do you want for Apache Spark 3.4.1? You can check Apache Spark 3.4.1 RC1 vote for the detail.

@dongjoon-hyun
Copy link
Member

@0xdarkman
Copy link

0xdarkman commented Jun 21, 2023

can we expect the fix to be in next spark version release 3.4.1?

Could you be more specific, @0xdarkman ? Which JIRA do you want for Apache Spark 3.4.1? You can check Apache Spark 3.4.1 RC1 vote for the detail.

I have been migrating from spark 3.3.2 and deployed spark 3.4.0 and had error with PVC ReadWriteMany access so this is how I come across this issue.
I understand I can not use spark 3.4.0 because the change is reverted so the version I can use either spark 3.3.2 or wait for spark 3.4.1 in order to use PVC with ReadWriteMany access when multiple executors used.

if it is fix, change reverted or whatever it does not matter to me as far as I can use new spark version with PVC access similarly to what spark version did prior v 3.4.0.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 22, 2023

Could you try Apache Spark 3.4.1 in advance? RC1 binary is already there and the vote will finished in 24 hours.

If you can participate the community vote, it would be great, @0xdarkman .

@0xdarkman
Copy link

I am running this version 3.4.1 and it looks good.

@dongjoon-hyun
Copy link
Member

Thank you for confirming. Apache Spark 3.4.1 is released officially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants