Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFJob should work well with pipelines #677

Closed
jlewi opened this issue Jan 14, 2019 · 25 comments
Closed

TFJob should work well with pipelines #677

jlewi opened this issue Jan 14, 2019 · 25 comments
Labels

Comments

@jlewi
Copy link
Contributor

jlewi commented Jan 14, 2019

I'm creating this uber issue to track issues related to making pipelines work well with TFJob. Hopefully this can serve as a model for how to make pipelines work well with other custom resources.

Here's a list of known issues
#407 TFJob doesn't forward error logs from the jobs
#408 TFJob doesn't stop trainer jobs after timeout
#218 provide argument to assign GCP service account

In addition I think there are some additional issues

  • The launcher is getting credentials in a GKE specific way

    • This seems unnecessary; if this code is running in a pod with a service account with suitable RBAC then it should be able to authenticate to the master in a way independent of the Cloud
  • The launcher doesn't seem to be resilient to pod restarts

    • The launcher is waiting for the job to finish
    • Jobs could be long running so its possible the pipelines pod will get preempted while waiting for the TFJob to finish
    • It looks like the launcher is using generateName to assign a name to the job.
    • So if the pod gets preempted it will restart and launch another instance of the job with a unique name.
@jlewi
Copy link
Contributor Author

jlewi commented Jan 14, 2019

Marking this as P1 for 0.5 because it would be great to have a canonical example of how to integrate with custom resources.

@jlewi jlewi added the help wanted The community is welcome to contribute. label Jan 22, 2019
@jlewi
Copy link
Contributor Author

jlewi commented Jan 23, 2019

Here's some thoughts on how this should work.

We should make the following changes to https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/launcher/src/launch_tf_job.py#L112

  • Delete references to gcloud to get the credentials.

  • For credentials I think we should support two methods

    1. If KUBE_CONFIG environment is set just call the K8s client library method to load
    2. Otherwise assume you are running in cluster and use the pod service account
  • The job name should be deterministic

    • This way if the pod restarts we can check if a job with the name already exists
  • We should move the logic inside "main" into a function e.g.

    def train_my_model(input, output):
      ...
    
  • We should use func_to_container to turn this into a pipeline op

    • We can create a suitable Dockerfile and image to use as the base image
    • I don't really think we need 1 Docker image per K8s resource; that seems verbose
    • Most of the required dependencies e.g. K8s client library will be the same regardless of the
      K8s resource
    • We can create generic utility functions like wait_for_condition that can be reused across resource
    • We might also want to include tools like ksonnet and helm so that people can easily reuse their
      existing templates.

A good way to prototype this might be to add a pipeline to the mnist example
https://github.com/kubeflow/examples/tree/master/mnist
Rather than focusing on updating the existing TFJob component.

See also: #29

@amygdala
Copy link
Contributor

amygdala commented Feb 4, 2019

On this point:

The launcher is getting credentials in a GKE specific way

I don't think this is necessary any more (IIRC, it was used prior to using the tf-job client directly). When I removed it from my own similar code, things worked.

(However, on a related note, adding .apply(gcp.use_gcp_secret('user-gcp-sa')) to the launcher step seems to break things, as that apparently does not allow creating cluster resources. Not sure if that's a bug exactly but it seems problematic. #705 )

@randxie
Copy link

randxie commented Mar 6, 2019

Adding another issue: when using TPU with tfjob, I found that the tfjob does not reuse allocated cloud TPU (if the pod fails and gets restarted). Instead, it tries to recreate a new TPU instance.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 11, 2019

@randxie Can you file a separate issue regarding TFJob and TPU in kubeflow/tf-operator I don't think that has anything to do with TFJob and pipelines.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 11, 2019

/cc @hougangliu since he just added a component for StudyJob.

@jessiezcc
Copy link
Contributor

@jlewi looks like most of issues are already closed, what's the remaining work relevant to pipeline that remains open?

@jlewi
Copy link
Contributor Author

jlewi commented Jul 15, 2019

@jessiezcc It looks like the pipeline launcher isn't resilient to the pod failing; see my initial comment on this issue. I believe that's still a problem.

@cyliu0204
Copy link

Hi all:
After reading the comments above , I still cannot figure out how to deploy distribute training with tensorflow or pytorch with pipelines, How is the issue progressing?

@mak-454
Copy link

mak-454 commented Oct 2, 2019

Hi, if launcher is to be used for TFjob or for any custom resources. What is the idea to handle volumes? If volumes are used with ContainerOp then the volumes will be attached to launcher and not to the training pods launched by TFJob.

@mak-454
Copy link

mak-454 commented Oct 2, 2019

I gather from below discussions,
#801 (comment)
#1345

that I should use ResourceOp for TFJob with volumes until the PR #1494 is merged

@jlewi
Copy link
Contributor Author

jlewi commented Oct 5, 2019

@mak-454 the current recommendation would be to write some python code to create the TFJob and then execute that code as a step in your graph.

Your python function can attach any volumes as needed to the TFJob.

@mak-454
Copy link

mak-454 commented Oct 6, 2019

@jlewi will do as recommended. Thanks.

@xiaogaozi
Copy link

Looks like mount volume for TFJob require the volume type is ReadWriteMany (all pods share same volume), so something like AWS EBS cannot be used in this case (it's ReadWriteOnce). It will be helpful if TFJob operator could support mount ReadWriteOnce volume.

@stale
Copy link

stale bot commented Jun 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 25, 2020
@stale
Copy link

stale bot commented Jul 2, 2020

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Jul 2, 2020
@pretidav
Copy link

pretidav commented Nov 9, 2020

/reopen

@k8s-ci-robot
Copy link
Contributor

@pretidav: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@amygdala
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Nov 10, 2020
@k8s-ci-robot
Copy link
Contributor

@amygdala: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 10, 2020
@pretidav
Copy link

Dear all, In the end is it possibile to use the tf_job_launcher within a pipeline and retrieve files (i.e. multiple outputs not just a json with scores like katib launcher does) produced by such a job?
E.g. think about launching a model training within a tf_job I would like to be able to retrieve scores, models and additional outputs from the training to be fed to subsequent components within the pipeline.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 14, 2020

@pretidav see my previous comment
#677 (comment)

It doesn't look like tf_job_launcher.py is being maintained.
https://github.com/kubeflow/pipelines/commits/master/components/kubeflow/launcher

@kubeflow/wg-training-leads What do you want to do about the TFJob Launcher? @hougangliu are you still maintaining the TFJob launcher per the owner file
https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/launcher/OWNERS

@pretidav
Copy link

Sorry @hougangliu, I noticed that in your e2e example ( https://github.com/kubeflow/pipelines/blob/master/samples/contrib/e2e-mnist/mnist-pipeline.ipynb ) you take advantage of the katib launcher but not the tfjob launcher.
Would you suggest to do the same or use both launchers?
(regarding that example I have another issue #4770 if you want to have a look)

@Jeffwan
Copy link
Member

Jeffwan commented Nov 17, 2020

I think the plan is to add more native launch support for different training operators.

This is tracked here but we don't have capacity to make it recently.

https://github.com/kubeflow/common/blob/master/ROADMAP.md
#3445

@jlewi
Copy link
Contributor Author

jlewi commented Nov 20, 2020

I'm closing this issue. @pretidav If you still have questions please open new more specific issues.

@jlewi jlewi closed this as completed Nov 20, 2020
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
* convert loop and lightweight example to component yaml

* Update python values to float
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests