TFJob should work well with pipelines #677

jlewi · 2019-01-14T18:04:40Z

I'm creating this uber issue to track issues related to making pipelines work well with TFJob. Hopefully this can serve as a model for how to make pipelines work well with other custom resources.

Here's a list of known issues
#407 TFJob doesn't forward error logs from the jobs
#408 TFJob doesn't stop trainer jobs after timeout
#218 provide argument to assign GCP service account

In addition I think there are some additional issues

The launcher is getting credentials in a GKE specific way
- This seems unnecessary; if this code is running in a pod with a service account with suitable RBAC then it should be able to authenticate to the master in a way independent of the Cloud
The launcher doesn't seem to be resilient to pod restarts
- The launcher is waiting for the job to finish
- Jobs could be long running so its possible the pipelines pod will get preempted while waiting for the TFJob to finish
- It looks like the launcher is using generateName to assign a name to the job.
- So if the pod gets preempted it will restart and launch another instance of the job with a unique name.

jlewi · 2019-01-14T18:05:18Z

Marking this as P1 for 0.5 because it would be great to have a canonical example of how to integrate with custom resources.

jlewi · 2019-01-23T13:28:09Z

Here's some thoughts on how this should work.

We should make the following changes to https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/launcher/src/launch_tf_job.py#L112

Delete references to gcloud to get the credentials.
For credentials I think we should support two methods
1. If KUBE_CONFIG environment is set just call the K8s client library method to load
2. Otherwise assume you are running in cluster and use the pod service account
The job name should be deterministic
- This way if the pod restarts we can check if a job with the name already exists
We should move the logic inside "main" into a function e.g.
```
def train_my_model(input, output):
  ...
```
We should use func_to_container to turn this into a pipeline op
- We can create a suitable Dockerfile and image to use as the base image
- I don't really think we need 1 Docker image per K8s resource; that seems verbose
- Most of the required dependencies e.g. K8s client library will be the same regardless of the
  K8s resource
- We can create generic utility functions like wait_for_condition that can be reused across resource
- We might also want to include tools like ksonnet and helm so that people can easily reuse their
  existing templates.

A good way to prototype this might be to add a pipeline to the mnist example
https://github.com/kubeflow/examples/tree/master/mnist
Rather than focusing on updating the existing TFJob component.

See also: #29

amygdala · 2019-02-04T17:39:40Z

On this point:

The launcher is getting credentials in a GKE specific way

I don't think this is necessary any more (IIRC, it was used prior to using the tf-job client directly). When I removed it from my own similar code, things worked.

(However, on a related note, adding .apply(gcp.use_gcp_secret('user-gcp-sa')) to the launcher step seems to break things, as that apparently does not allow creating cluster resources. Not sure if that's a bug exactly but it seems problematic. #705 )

randxie · 2019-03-06T21:12:28Z

Adding another issue: when using TPU with tfjob, I found that the tfjob does not reuse allocated cloud TPU (if the pod fails and gets restarted). Instead, it tries to recreate a new TPU instance.

jlewi · 2019-03-11T00:25:55Z

@randxie Can you file a separate issue regarding TFJob and TPU in kubeflow/tf-operator I don't think that has anything to do with TFJob and pipelines.

jlewi · 2019-03-11T00:30:14Z

/cc @hougangliu since he just added a component for StudyJob.

jessiezcc · 2019-07-09T17:15:30Z

@jlewi looks like most of issues are already closed, what's the remaining work relevant to pipeline that remains open?

jlewi · 2019-07-15T00:59:48Z

@jessiezcc It looks like the pipeline launcher isn't resilient to the pod failing; see my initial comment on this issue. I believe that's still a problem.

cyliu0204 · 2019-08-22T08:13:20Z

Hi all:
After reading the comments above , I still cannot figure out how to deploy distribute training with tensorflow or pytorch with pipelines, How is the issue progressing?

mak-454 · 2019-10-02T10:14:33Z

Hi, if launcher is to be used for TFjob or for any custom resources. What is the idea to handle volumes? If volumes are used with ContainerOp then the volumes will be attached to launcher and not to the training pods launched by TFJob.

mak-454 · 2019-10-02T11:54:44Z

I gather from below discussions,
#801 (comment)
#1345

that I should use ResourceOp for TFJob with volumes until the PR #1494 is merged

jlewi · 2019-10-05T05:13:05Z

@mak-454 the current recommendation would be to write some python code to create the TFJob and then execute that code as a step in your graph.

Your python function can attach any volumes as needed to the TFJob.

mak-454 · 2019-10-06T07:09:10Z

@jlewi will do as recommended. Thanks.

xiaogaozi · 2019-10-09T07:02:40Z

Looks like mount volume for TFJob require the volume type is ReadWriteMany (all pods share same volume), so something like AWS EBS cannot be used in this case (it's ReadWriteOnce). It will be helpful if TFJob operator could support mount ReadWriteOnce volume.

stale · 2020-06-25T10:22:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-02T10:58:40Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

pretidav · 2020-11-09T13:09:48Z

/reopen

k8s-ci-robot · 2020-11-09T13:09:51Z

@pretidav: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

amygdala · 2020-11-10T15:15:12Z

/reopen

k8s-ci-robot · 2020-11-10T15:15:18Z

@amygdala: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pretidav · 2020-11-13T10:51:05Z

Dear all, In the end is it possibile to use the tf_job_launcher within a pipeline and retrieve files (i.e. multiple outputs not just a json with scores like katib launcher does) produced by such a job?
E.g. think about launching a model training within a tf_job I would like to be able to retrieve scores, models and additional outputs from the training to be fed to subsequent components within the pipeline.

jlewi · 2020-11-14T16:13:41Z

@pretidav see my previous comment
#677 (comment)

It doesn't look like tf_job_launcher.py is being maintained.
https://github.com/kubeflow/pipelines/commits/master/components/kubeflow/launcher

@kubeflow/wg-training-leads What do you want to do about the TFJob Launcher? @hougangliu are you still maintaining the TFJob launcher per the owner file
https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/launcher/OWNERS

pretidav · 2020-11-17T09:58:33Z

Sorry @hougangliu, I noticed that in your e2e example ( https://github.com/kubeflow/pipelines/blob/master/samples/contrib/e2e-mnist/mnist-pipeline.ipynb ) you take advantage of the katib launcher but not the tfjob launcher.
Would you suggest to do the same or use both launchers?
(regarding that example I have another issue #4770 if you want to have a look)

Jeffwan · 2020-11-17T17:42:29Z

I think the plan is to add more native launch support for different training operators.

This is tracked here but we don't have capacity to make it recently.

https://github.com/kubeflow/common/blob/master/ROADMAP.md
#3445

jlewi · 2020-11-20T15:01:35Z

I'm closing this issue. @pretidav If you still have questions please open new more specific issues.

* convert loop and lightweight example to component yaml * Update python values to float

jlewi added the priority/p1 label Jan 14, 2019

jlewi added the help wanted The community is welcome to contribute. label Jan 22, 2019

jlewi mentioned this issue Jan 22, 2019

Pipeline and Katib Integration kubeflow/katib#331

Closed

jlewi mentioned this issue Jan 23, 2019

example TFX taxi support on-prem cluster #721

Closed

swiftdiaries mentioned this issue Jan 31, 2019

Support K8 resource YAML config as input for DSL components #756

Closed

richardsliu added the area/1.0.0 label Mar 7, 2019

kkasravi mentioned this issue Mar 19, 2019

kfctl use case: deploy a component from within a container kubeflow/kubeflow#2724

Closed

vicaire removed the area/1.0.0 label Mar 26, 2019

vicaire added priority/p0 kind/feature area/components and removed priority/p1 labels Mar 26, 2019

vicaire assigned hongye-sun and unassigned hongye-sun Mar 26, 2019

randxie mentioned this issue May 24, 2019

add utility functions to stream log from worker containers back to tfjobs kubeflow/training-operator#1010

Closed

ryandawsonuk mentioned this issue May 30, 2019

Seldon examples #1405

Merged

jessiezcc added priority/p1 and removed priority/p0 labels Jul 20, 2019

ryandawsonuk mentioned this issue Feb 13, 2020

can't create TFJob from ResourceOp #2624

Closed

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 25, 2020

stale bot closed this as completed Jul 2, 2020

k8s-ci-robot reopened this Nov 10, 2020

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 10, 2020

jlewi closed this as completed Nov 20, 2020

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024

Convert loop and lightweight example to component yaml (kubeflow#677)

2ddcec7

* convert loop and lightweight example to component yaml * Update python values to float

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFJob should work well with pipelines #677

TFJob should work well with pipelines #677

jlewi commented Jan 14, 2019

jlewi commented Jan 14, 2019

jlewi commented Jan 23, 2019 •

edited

Loading

amygdala commented Feb 4, 2019

randxie commented Mar 6, 2019

jlewi commented Mar 11, 2019

jlewi commented Mar 11, 2019

jessiezcc commented Jul 9, 2019

jlewi commented Jul 15, 2019

cyliu0204 commented Aug 22, 2019

mak-454 commented Oct 2, 2019

mak-454 commented Oct 2, 2019

jlewi commented Oct 5, 2019

mak-454 commented Oct 6, 2019

xiaogaozi commented Oct 9, 2019

stale bot commented Jun 25, 2020

stale bot commented Jul 2, 2020

pretidav commented Nov 9, 2020

k8s-ci-robot commented Nov 9, 2020

amygdala commented Nov 10, 2020

k8s-ci-robot commented Nov 10, 2020

pretidav commented Nov 13, 2020

jlewi commented Nov 14, 2020

pretidav commented Nov 17, 2020

Jeffwan commented Nov 17, 2020

jlewi commented Nov 20, 2020

TFJob should work well with pipelines #677

TFJob should work well with pipelines #677

Comments

jlewi commented Jan 14, 2019

jlewi commented Jan 14, 2019

jlewi commented Jan 23, 2019 • edited Loading

amygdala commented Feb 4, 2019

randxie commented Mar 6, 2019

jlewi commented Mar 11, 2019

jlewi commented Mar 11, 2019

jessiezcc commented Jul 9, 2019

jlewi commented Jul 15, 2019

cyliu0204 commented Aug 22, 2019

mak-454 commented Oct 2, 2019

mak-454 commented Oct 2, 2019

jlewi commented Oct 5, 2019

mak-454 commented Oct 6, 2019

xiaogaozi commented Oct 9, 2019

stale bot commented Jun 25, 2020

stale bot commented Jul 2, 2020

pretidav commented Nov 9, 2020

k8s-ci-robot commented Nov 9, 2020

amygdala commented Nov 10, 2020

k8s-ci-robot commented Nov 10, 2020

pretidav commented Nov 13, 2020

jlewi commented Nov 14, 2020

pretidav commented Nov 17, 2020

Jeffwan commented Nov 17, 2020

jlewi commented Nov 20, 2020

jlewi commented Jan 23, 2019 •

edited

Loading