Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipeline launcher components for other distributed training jobs #3445

Closed
Jeffwan opened this issue Apr 5, 2020 · 21 comments
Closed

Add pipeline launcher components for other distributed training jobs #3445

Jeffwan opened this issue Apr 5, 2020 · 21 comments
Labels
area/sdk/dsl lifecycle/stale The issue / pull request is stale, any activities remove this label. status/triaged Whether the issue has been explicitly triaged

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Apr 5, 2020

In order to leverage different training operators in kubeflow pipeline, it would be better to provide high level launcher components as an abstraction to invoke training jobs.

katib-launcher and launcher are launcher componets for katib and tf-operator. We definitely need more similar components for PyTorch, MxNet, MPI and XGBoost, etc.

https://github.com/kubeflow/pipelines/tree/master/components/kubeflow

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 6, 2020

What do you think about having generic launcher components that receive resolved serialized TaskSpec (or container image + command-line) and launch the given component.

What do you think about syntax like this?

MyLauncher = load_component(...)
with dsl.use_launcher(MyLauncher(num_workers=10)):
    launched_task = XGBoostTrainer(training_data=..., num_trees=500)

or

MyLauncher = load_component(...)
launcher_for_train = MyLauncher(
    num_workers=10,
    task=XGBoostTrainer(training_data=..., num_trees=500),
)

@Ark-kun Ark-kun self-assigned this Apr 6, 2020
@Bobgy Bobgy added the status/triaged Whether the issue has been explicitly triaged label Apr 15, 2020
@stale
Copy link

stale bot commented Jul 14, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 14, 2020
@stale
Copy link

stale bot commented Jul 22, 2020

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@Jeffwan
Copy link
Member Author

Jeffwan commented Nov 17, 2020

/reopen

@k8s-ci-robot
Copy link
Contributor

@Jeffwan: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 17, 2020
@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 17, 2020
@midhun1998
Copy link
Member

midhun1998 commented Feb 18, 2021

Hi @Jeffwan and @Ark-kun . I would like to contribute to this issue. Please let me know how I can be of any help. :)

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 2, 2021
@wangli1426
Copy link

wangli1426 commented Jun 20, 2021

Any update to this feature?

I believe it would be great that Kuebflow pipelien can provide a generic launcher that creates CRD and manages the lifespan of a CRD, like MPIJob, PyTorchJobs, etc.

This requirement can be partially satisfied by using Kitlab Expeirment. However, as far as I know, there are some clear drawback of this approach:

  • It is possible that the launcher pod failed, but the CRD is still running.
  • An experiment can have multiple trials, each of which can be a CRD, like MPIJobs.

Thus, it is desirable to have a GenericLauncher in Kubeflow Pipeline, and an operator to manage the life span of the launcher pod and the created CRDs.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 20, 2021
@jalola
Copy link

jalola commented Jun 25, 2021

Hi,
I am also looking for this feature, especially Pytorch, the PR for it seems pausing for some time #5170

I could run the distributed training using PytorchJob (created by ResourceOp), this way has a disadvantage that it does not show the logs in the UI of the pipeline, it only shows the logs of the job controller not the worker container.

@ca-scribner please help continue the PR, thanks a lot.

@wangli1426
Copy link

@jalola Thanks for the info. Do you mind to share an example on how to define a PytorchJob with the help of ResourceOp? Thanks in advance.

@jalola
Copy link

jalola commented Jun 25, 2021

@wangli1426
The simple example of ResourceOp: https://github.com/kubeflow/pipelines/blob/master/samples/core/resource_ops/resource_ops.py

For the PytorchJob: https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/v1/pytorch_job_mnist_nccl.yaml
You can make it as json code

Remember to set on_success_condition, example:
success_condition='status.replicaStatuses.Worker.succeeded==3,status.replicaStatuses.Chief.succeeded==1'
https://github.com/kubeflow/pipelines/blob/master/samples/contrib/e2e-mnist/mnist-pipeline.ipynb

@midhun1998
Copy link
Member

Hi @jalola . Just wondering how can we stream all worker logs(when no of workers > 1) into pipeline log console? Or were you looking for just the logs of chief? Do you have any idea in mind?

@jalola
Copy link

jalola commented Jun 25, 2021

I only know they have the client sdk to get logs
Example:
https://github.com/kubeflow/pytorch-operator/blob/4aeb6503162465766476519339d3285f75ffe03e/sdk/python/examples/kubeflow-pytorchjob-sdk.ipynb

API: https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/docs/PyTorchJobClient.md#get_logs

But I don't know how to show the logs to a component of pipeline.

@ca-scribner
Copy link
Contributor

ca-scribner commented Jun 25, 2021 via email

@Ark-kun
Copy link
Contributor

Ark-kun commented Jul 11, 2021

But I don't know how to show the logs to a component of pipeline.

You could just print them.

@jalola
Copy link

jalola commented Jul 12, 2021

But I don't know how to show the logs to a component of pipeline.

You could just print them.

I am using k8s_client API (Watch and read_namespaced_pod_log) to stream the logs from training pod. This one works.
PyTorchJobClient get_logs(follow=True) does not stream line by line of the logs but the whole logs (when the training finishes).

@Ark-kun Another trouble I find when using launch_crd is that: on the Kubeflow pipeline, if users "terminate" the run of the pipeline, only the training controller pod (which is the launch_crd) is deleted, the distributed training pod will continue running. What do you think? You can give some advise, I may implement to the #5170

@stale
Copy link

stale bot commented Mar 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 3, 2022
@dkmiller
Copy link

dkmiller commented Aug 4, 2022

Hi everyone, I'm quite interested in this as well. Is there any progress towards built-in support for distributed training jobs in pipelines?

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 4, 2022
@bhack
Copy link

bhack commented Dec 24, 2023

Is this still in the roadmap?

@Ark-kun Ark-kun removed their assignment Dec 26, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 25, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sdk/dsl lifecycle/stale The issue / pull request is stale, any activities remove this label. status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

No branches or pull requests

10 participants