-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Argo Resource Template in DSL #429
Comments
Yes. We can. The question is how we want to expose them and how we can gather output. A custom resource is not container specific, so there is no output file mapping. The only thing that argo supports is to query the job status (and with a jqfilter or json path https://github.com/argoproj/argo/blob/master/examples/k8s-jobs.yaml) and get the output. Not all k8s CRD reports output that way (including tf-job), so it is hard to pass output of the step to downstream components. I would like to support two levels of K8sOp. The low level is arbitrary resource: op = K8sOp(resource_spec, jsonPath='...') We should also provide resource specific high level ops: op = TfJobOp(container_image='...', ...) This was deferred since TfJob does not output anything in job status. We can in theory customize how tf-job runs the container and then inject the output into job status, but that was non-trivial. |
@jlewi Shall we instead fix tf-job to report the status to k8s? I feel that K8sOp itself is still very useful even with the limitations. |
@hongye-sun I'm not sure I understand the question. TFJob already reports its status using conditions. |
@jlewi my question is based on Qiming's comment on tfjob's status. The goal I think is to pass the output model data from tfjob to subsequent steps in the pipeline. I am not familiar with tfjob. Is there an easy way that user uses argo resource template feature and pass the model data to next step today? |
TFJob doesn't support passing the model on to the next step. It manages the lifecycle of the training alone. |
@qimingj Why do you need to customize anything about TFJob and K8s job? These are existing resources that don't have any explicit notion of inputs/outputs. Users use environment variables and command line arguments to pass information to their code. Its up to the code to define what the inputs/outputs are. TFJob and K8s job don't try to impose structure on the code by forcing them to conform to some standard for inputs/outputs. Can pipelines easily orchestrate K8s resources? e.g. can I specify a step in my pipeline that contains the spec of some K8s resource? I think the desired semantics would be
From a DSL perspective, I'd expect a convenient Python library for generating the K8s resource e.g.
Ideally the python client libraries would be autogenerated from the resource specs. I don't know what the current state is of the requisite K8s python client library tooling. |
DSL today requires K8s python client library. Accepting K8s resource specs is a great idea. You are right that it's up to the code to define the inputs/outputs. I think the gap (not just tf-job, but k8s in general) is that we need a way for component to communicate back to pipeline system. For example, a runtime value (such as training accuracy), some UI metadata (so pipeline UI can visualize), or some artifacts (materialized data) etc. Some of these values may be known before the pipeline runs (e.g. a model dir), but some are not (e.g. accuracy). ... It is possible that "trainer.model_path" is known beforehand, but certainly not "trainer.accuracy". Argo provides a way for this type of communication: if the results are included in the K8s job status, then argo can parse it and extract the values. I was thinking if TF Job is more "argo-friendly" by inserting results into job status, it would make such K8s spec on par with containers. Or, like you said we can declare supporting arbitrary K8s job, but these jobs cannot output any run time values, cannot create any visualizations, etc, until we figure out something else? |
Why do we need a way to communicate back to pipelines components? Lets suppose we have a TensorFlow program that takes two arguments --input and --output and now we want to train that using TFJob or K8s job. Why can't we do something like the following
Now just use func_to_container_op to turn this into an op for the DSL. |
I was looking into Argo Events and this seems like a good fit. I'm currently experimenting with monitoring TFJob resources using this. As for why we need a way to communicate back to pipelines, |
@hongye-sun, should we close? Are you expecting more work on this issue? |
Hi everyone, Looks like this issue is covered by #926 and So, pinging @hongye-sun, do you think this is covered and we should close it? |
Yes, it has been covered. Thanks. |
Copied from: #415
Related question, how can the DSL be easily extended to orchestrate custom Kubernetes resources? It would be nice if I could just "inline" custom resources (similar to what Argo supports argoproj/argo-workflows#606).
I think the desired semantics would be something like
The text was updated successfully, but these errors were encountered: