-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob #1991
Comments
The
I'm not sure why we need to use ElasticPolicy as a default once we switch to the |
@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/pkg/controller.v1/pytorch/envvar.go#L109). And the above mentioned environment variables are necessary to start multi node training as mentioned in . |
It makes sense. Thank you for investigating this :) |
Hi @andreyvelich @tenzen-y Has this issue been discussed clearly? If so can I take it and do some implementation? |
Hi @ckcd, we haven't got a chance to discuss this issue in details yet. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove lifecycle/stale |
@andreyvelich: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/good-first-issue |
@andreyvelich: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @andreyvelich |
Currently, we use
python3
as an entrypoint to create Training Job using function: https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/sdk/python/kubeflow/training/utils/utils.py#L230Since it is recommended to use the
torchrun
as an entrypoint to run distributed PyTorch, we should discuss if we need to change the entrypoint for PyTorchJob created from function.Also, we need to set the ElasticPolicy
c10d
backend.We need to make sure that we can use
torchrun
with PyTorch code that is not usingdistributed
capabilities.cc @johnugeorge @tenzen-y @deepanker13 @kuizhiqing
The text was updated successfully, but these errors were encountered: