-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFJob should work well with pipelines #677
Comments
Marking this as P1 for 0.5 because it would be great to have a canonical example of how to integrate with custom resources. |
Here's some thoughts on how this should work. We should make the following changes to https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/launcher/src/launch_tf_job.py#L112
A good way to prototype this might be to add a pipeline to the mnist example See also: #29 |
On this point:
I don't think this is necessary any more (IIRC, it was used prior to using the tf-job client directly). When I removed it from my own similar code, things worked. (However, on a related note, adding |
Adding another issue: when using TPU with tfjob, I found that the tfjob does not reuse allocated cloud TPU (if the pod fails and gets restarted). Instead, it tries to recreate a new TPU instance. |
@randxie Can you file a separate issue regarding TFJob and TPU in kubeflow/tf-operator I don't think that has anything to do with TFJob and pipelines. |
/cc @hougangliu since he just added a component for StudyJob. |
@jlewi looks like most of issues are already closed, what's the remaining work relevant to pipeline that remains open? |
@jessiezcc It looks like the pipeline launcher isn't resilient to the pod failing; see my initial comment on this issue. I believe that's still a problem. |
Hi all: |
Hi, if launcher is to be used for TFjob or for any custom resources. What is the idea to handle volumes? If volumes are used with ContainerOp then the volumes will be attached to launcher and not to the training pods launched by TFJob. |
I gather from below discussions, that I should use ResourceOp for TFJob with volumes until the PR #1494 is merged |
@mak-454 the current recommendation would be to write some python code to create the TFJob and then execute that code as a step in your graph. Your python function can attach any volumes as needed to the TFJob. |
@jlewi will do as recommended. Thanks. |
Looks like mount volume for TFJob require the volume type is |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/reopen |
@pretidav: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@amygdala: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Dear all, In the end is it possibile to use the tf_job_launcher within a pipeline and retrieve files (i.e. multiple outputs not just a json with scores like katib launcher does) produced by such a job? |
@pretidav see my previous comment It doesn't look like tf_job_launcher.py is being maintained. @kubeflow/wg-training-leads What do you want to do about the TFJob Launcher? @hougangliu are you still maintaining the TFJob launcher per the owner file |
Sorry @hougangliu, I noticed that in your e2e example ( https://github.com/kubeflow/pipelines/blob/master/samples/contrib/e2e-mnist/mnist-pipeline.ipynb ) you take advantage of the katib launcher but not the tfjob launcher. |
I think the plan is to add more native launch support for different training operators. This is tracked here but we don't have capacity to make it recently. https://github.com/kubeflow/common/blob/master/ROADMAP.md |
I'm closing this issue. @pretidav If you still have questions please open new more specific issues. |
* convert loop and lightweight example to component yaml * Update python values to float
I'm creating this uber issue to track issues related to making pipelines work well with TFJob. Hopefully this can serve as a model for how to make pipelines work well with other custom resources.
Here's a list of known issues
#407 TFJob doesn't forward error logs from the jobs
#408 TFJob doesn't stop trainer jobs after timeout
#218 provide argument to assign GCP service account
In addition I think there are some additional issues
The launcher is getting credentials in a GKE specific way
The launcher doesn't seem to be resilient to pod restarts
The text was updated successfully, but these errors were encountered: