-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Support PyTorchJob as a Trial Worker #2512
[SDK] Support PyTorchJob as a Trial Worker #2512
Conversation
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
thank you for trusting me to review this. would an e2e test be helpful since we're using PyTorchJob under the hood? everything seems good. thank you. |
I'll try to check the functionality after having unit tests. |
Signed-off-by: Andrey Velichkevich <[email protected]>
/assign @kubeflow/wg-training-leads @mahdikhashan @Electronic-Waste @helenxie-bit I've added unit and e2e tests. |
Signed-off-by: Andrey Velichkevich <[email protected]>
b1274fc
to
238fda3
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM, I'll try running it myself once. Thanks!
# Restart policy must be set for the Job. | ||
pod_template_spec.spec.restart_policy = "Never" # type: ignore | ||
|
||
# Use PyTorchJob as a Trial spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update comment
@@ -44,7 +43,7 @@ def objective(parameters): | |||
parameters=parameters, | |||
objective_metric_name="result", | |||
max_trial_count=4, | |||
resources_per_trial={"cpu": "2"}, | |||
resources_per_trial={"cpu": "100m"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for using miliCPU here.
unit tests passed successfully on my local, however after 13 mins the e2e test state in ui is my k8s version is 1.30.2 (which is not the same as the ci), my os is mac sonoma. i'd say its fine as long as it is passing on ci. thanks for your time and trusting me on reviewing it. |
Did you install Training Operator 1.9.0 before running an example ? |
Signed-off-by: Andrey Velichkevich <[email protected]>
/assign @mahdikhashan @saileshd1402 |
@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
yes, i did it. I need to make sure my own env is correct - passing in pr is enough. thanks. |
thank you - for me its fine to merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx!
/lgtm
/approve
@mahdikhashan: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Electronic-Waste, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This should add support for PyTorchJob as Trial Worker when user tried to optimize HPs with distributed training.
I've reduced deduplication by re-using a few assets from the Kubeflow Training Operator SDK (v1.9.0), and added this as the default dependency for Kubeflow Katib SDK.
It would be nice to have this feature before we cut the RC.0 for Katib 0.18.0
TODO:
/assign @kubeflow/wg-training-leads @helenxie-bit @Electronic-Waste @mahdikhashan @truc0 @shashank-iitbhu @astefanutti