Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] improve PVC creation name error #2496

Merged

Conversation

mahdikhashan
Copy link
Contributor

What this PR does / why we need it:
This PR handles potential name errors for PVCs gracefully.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2491

Checklist:

  • Docs included if any changes are user facing

Copy link
Contributor

@helenxie-bit helenxie-bit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Basically LGTM, just a small comment.

# RFC 1123 regex for valid PVC names: lowercase alphanumeric, '-', or '.'.
return bool(
re.match(
r"^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(\.[a-z0-9]([a-z0-9\-]*[a-z0-9])?)*$", name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using the same regex format as shown in the error message improve readability and maintainability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you please elaborate what you mean here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean using this regex format in the ValueError message: '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*', since it took me a while to compare if they stand for the same thing. Or is there specific reason you changed the format a little bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for bringing this to my attention. shall i add any unit test for it? please let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, there are two failing ci tests, i'm guessing they are flaky tests, would there be any problem that this change have caused?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the unit tests for tune API is still under review, I think you can add your unit test after that one is merged.

Copy link
Contributor

@helenxie-bit helenxie-bit Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CI test failures are caused by resource problems. I've rerun the tests once, but one of them still failed due to network connectivity issue. Maybe we can try running them again later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your time and help in this matter.

@andreyvelich andreyvelich added this to the v0.18 milestone Jan 20, 2025
@helenxie-bit
Copy link
Contributor

/rerun-all

3 similar comments
@helenxie-bit
Copy link
Contributor

/rerun-all

@mahdikhashan
Copy link
Contributor Author

/rerun-all

@mahdikhashan
Copy link
Contributor Author

/rerun-all

@helenxie-bit
Copy link
Contributor

Thanks for the contribution!

/lgtm
/rerun-all

@mahdikhashan
Copy link
Contributor Author

Thanks for the contribution!

/lgtm /rerun-all

is it fine if i add a few unit tests after the main unit test pr got merged - keeping this pr open till then?

@helenxie-bit
Copy link
Contributor

/assign @tenzen-y @andreyvelich

@helenxie-bit
Copy link
Contributor

Thanks for the contribution!
/lgtm /rerun-all

is it fine if i add a few unit tests after the main unit test pr got merged - keeping this pr open till then?

The main unit test is already approved, but it seems the CI test is still in progress. I think we can merge this PR first, and it would be better to open a new PR to add unit tests for this.

@@ -557,6 +557,20 @@ class name in this argument.
# Create PVC for the Storage Initializer.
# TODO (helenxie-bit): PVC Creation should be part of Katib Controller.
try:
if not utils.is_valid_pvc_name(name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what is the goal to introduce additional validation on top of Kubernetes default validation ?
Are we trying to make this message more user friendly ?
cc @kubeflow/wg-training-leads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's the point.

Copy link
Contributor Author

@mahdikhashan mahdikhashan Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, basically my goal was to make the message user friendlier. i'm open to suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reconcile this error similar to 409 (CRD is already exist), so we won't introduce additional validation ?

except Exception as e:
if hasattr(e, "status") and e.status == 409:
raise Exception(
f"A Katib Experiment with the name "
f"{namespace}/{experiment_name} already exists."
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right - i'll change it.

@google-oss-prow google-oss-prow bot removed the lgtm label Jan 27, 2025
@mahdikhashan mahdikhashan force-pushed the improve-pvc-error-message branch from c74b4d6 to 3273e0f Compare January 27, 2025 18:38
@google-oss-prow google-oss-prow bot added size/XXL and removed size/S labels Jan 27, 2025
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
@mahdikhashan
Copy link
Contributor Author

Hi @mahdikhashan, do you have time to work on this PR before the RC.0 that we want to cut in 2 days ? Otherwise, we can include this feature in RC.1

done. i kindly ask for your review.

sdk/python/v1beta1/kubeflow/katib/api/katib_client_test.py Outdated Show resolved Hide resolved
@@ -569,6 +569,11 @@ class name in this argument.
),
)
except Exception as e:
if hasattr(e, "status") and e.status == 422:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should move this check after this line since the error belongs to that scenario:


Additionally, to make the error easier to understand, we could tweak the error message a bit. How about this:

if hasattr(e, "status") and e.status == 422:
    raise ValueError(
        f"An Experiment with the name {name} is not valid: the name must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
    )
else:
    raise RuntimeError(f"failed to create PVC. Error: {e}")

Copy link
Contributor Author

@mahdikhashan mahdikhashan Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the ValueError message (with a bit of change), with combining it with else, not really - since then we need to wait for this steps first:

pvc_list = self.core_api.list_namespaced_persistent_volume_claim(
    namespace=namespace
)
# Check if the PVC with the specified name exists.
for pvc in pvc_list.items:
    if pvc.metadata.name == name:
        print(
            f"PVC '{name}' already exists in namespace " f"{namespace}."
        )
        break

so my idea is the function fails fast when the name is invalid then if its valid, continues with the check for the existence of the name. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
@helenxie-bit
Copy link
Contributor

Thanks for the contribution!

/lgtm
/assign @kubeflow/wg-automl-leads @Electronic-Waste

@google-oss-prow google-oss-prow bot added the lgtm label Jan 27, 2025
f"alphanumeric characters ('a-z', '0-9'), hyphens ('-'), or periods ('.'). "
f"It must also start and end with an alphanumeric character."
)

pvc_list = self.core_api.list_namespaced_persistent_volume_claim(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also simplify this logic similar to this one:

raise Exception(
f"A Katib Experiment with the name "
f"{namespace}/{experiment_name} already exists."
)

E.g. if status_code is 409 we just print that PVC already exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add more details to the error message, it’ll make it easier for users to understand, which is the goal of this PR. But you’re right—Kubernetes API will also return detailed error reasons. So it depends on whether we want to keep the error messages consistent across the board.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is a bit more leaned to Helens. But I'm open to any changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant that all of these:

                pvc_list = self.core_api.list_namespaced_persistent_volume_claim(
                    namespace=namespace
                )
                # Check if the PVC with the specified name exists.
                for pvc in pvc_list.items:
                    if pvc.metadata.name == name:
                        print(
                            f"PVC '{name}' already exists in namespace " f"{namespace}."
                        )
                        break
                else:
                    raise RuntimeError(f"failed to create PVC. Error: {e}")

can be replaced to

                elif hasattr(e, "status") and e.status == 409:
                    print(f"PVC '{name}' already exists in namespace " f"{namespace}.")
                else:
                    raise RuntimeError(f"failed to create PVC. Error: {e}")

Does it make sense ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, SGTM 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Signed-off-by: mahdikhashan <[email protected]>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @mahdikhashan!
/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Jan 27, 2025
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 09523cd into kubeflow:master Jan 28, 2025
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SDK] improve PVC error message
4 participants