Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Support PyTorchJob as a Trial Worker #2512

Merged
merged 11 commits into from
Feb 13, 2025

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Feb 12, 2025

This should add support for PyTorchJob as Trial Worker when user tried to optimize HPs with distributed training.

I've reduced deduplication by re-using a few assets from the Kubeflow Training Operator SDK (v1.9.0), and added this as the default dependency for Kubeflow Katib SDK.

It would be nice to have this feature before we cut the RC.0 for Katib 0.18.0

TODO:

  • Update unit tests

/assign @kubeflow/wg-training-leads @helenxie-bit @Electronic-Waste @mahdikhashan @truc0 @shashank-iitbhu @astefanutti

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich changed the title [WIP] [SDK] Support PyTorchJob as Trial Worker [WIP] [SDK] Support PyTorchJob as a Trial Worker Feb 12, 2025
@mahdikhashan
Copy link
Contributor

thank you for trusting me to review this. would an e2e test be helpful since we're using PyTorchJob under the hood? everything seems good. thank you.

@mahdikhashan
Copy link
Contributor

thank you for trusting me to review this. would an e2e test be helpful since we're using PyTorchJob under the hood? everything seems good. thank you.

I'll try to check the functionality after having unit tests.

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich changed the title [WIP] [SDK] Support PyTorchJob as a Trial Worker [SDK] Support PyTorchJob as a Trial Worker Feb 12, 2025
@andreyvelich
Copy link
Member Author

/assign @kubeflow/wg-training-leads @mahdikhashan @Electronic-Waste @helenxie-bit I've added unit and e2e tests.
Please take a look!

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the sdk-support-pytorchjob branch from b1274fc to 238fda3 Compare February 12, 2025 16:39
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link

@saileshd1402 saileshd1402 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM, I'll try running it myself once. Thanks!

# Restart policy must be set for the Job.
pod_template_spec.spec.restart_policy = "Never" # type: ignore

# Use PyTorchJob as a Trial spec.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update comment

@@ -44,7 +43,7 @@ def objective(parameters):
parameters=parameters,
objective_metric_name="result",
max_trial_count=4,
resources_per_trial={"cpu": "2"},
resources_per_trial={"cpu": "100m"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for using miliCPU here.

@mahdikhashan
Copy link
Contributor

unit tests passed successfully on my local, however after 13 mins the e2e test state in ui is Couldn't find any successful Trial..

my k8s version is 1.30.2 (which is not the same as the ci), my os is mac sonoma.

i'd say its fine as long as it is passing on ci.

thanks for your time and trusting me on reviewing it.

@andreyvelich
Copy link
Member Author

unit tests passed successfully on my local, however after 13 mins the e2e test state in ui is Couldn't find any successful Trial..

my k8s version is 1.30.2 (which is not the same as the ci), my os is mac sonoma.

i'd say its fine as long as it is passing on ci.

thanks for your time and trusting me on reviewing it.

Did you install Training Operator 1.9.0 before running an example ?
As you can see the E2E works fine in this PR.

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich
Copy link
Member Author

/assign @mahdikhashan @saileshd1402
Any additional comments or we can merge it, so I will release a first RC.0 ?

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @mahdikhashan @saileshd1402
Any additional comments or we can merge it, so I will release a first RC.0 ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mahdikhashan
Copy link
Contributor

mahdikhashan commented Feb 13, 2025

unit tests passed successfully on my local, however after 13 mins the e2e test state in ui is Couldn't find any successful Trial..
my k8s version is 1.30.2 (which is not the same as the ci), my os is mac sonoma.
i'd say its fine as long as it is passing on ci.
thanks for your time and trusting me on reviewing it.

Did you install Training Operator 1.9.0 before running an example ? As you can see the E2E works fine in this PR.

yes, i did it. I need to make sure my own env is correct - passing in pr is enough. thanks.

@mahdikhashan
Copy link
Contributor

/assign @mahdikhashan @saileshd1402 Any additional comments or we can merge it, so I will release a first RC.0 ?

thank you - for me its fine to merge it.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!

/lgtm
/approve

Copy link

@mahdikhashan: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot merged commit 7b46520 into kubeflow:master Feb 13, 2025
66 checks passed
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Electronic-Waste, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich andreyvelich deleted the sdk-support-pytorchjob branch February 13, 2025 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants