-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Reconsider the design of Trial Template #906
Comments
/priority p0 /cc @johnugeorge @hougangliu @richardsliu We can discuss here. |
based on the proposal #857 (comment), I refactor struct as below:
|
Should we support structured template? E.g. TFJob and PyTorchJob. |
@hougangliu Can you add examples of what a template would look like? |
Here list the
In summary, |
For Job-type trials, do we actually need anything more than "image" and "command"?
So I think we can cut all of this down to just:
For distributed jobs it can be more complicated, but still most of the spec can be auto-generated:
What do you think? |
Prefer not to have our own spec with image and command. If users want to set resource requirements, our abstraction will limit this scenerio. |
@richardsliu I was thinking the same way earlier that you suggested. But the biggest problem that I see is that Katib will need to have framework specific logic to generate the corresponding job type. Eg: Generate TFJob/PytorchJob template |
@johnugeorge Then can we do something like:
Would this work? I think this would address @jlewi 's concerns in #857. |
@richardsliu Can you explain bit more on 1) |
I was thinking along the lines of this answer: https://stackoverflow.com/a/9577670 Suppose we could somehow include the path to an existing Job/TFJob/PytorchJob definition (in valid YAML) as part of the trial template. We can then add another configmap for the hyperparameters to substitute. This means we can preserve all the fields in the original YAML templates, without cluttering the experiment definition. We can also generate the full experiment YAML programmatically. |
Tekton might be a good reference point for how to store templates and parameterize them. |
@jlewi Thanks for the suggestion. Do you mean this way? spec:
resources:
- name: source-repo
resourceSpec:
type: git
params:
- name: revision
value: v0.32.0
- name: url
value: https://github.com/GoogleContainerTools/skaffold
- name: web-image
resourceSpec:
type: image
params:
- name: url
value: gcr.io/christiewilson-catfactory/leeroy-web
- name: app-image
resourceSpec:
type: image
params:
- name: url
value: gcr.io/christiewilson-catfactory/leeroy-app |
@gaocegege yes; To be clear I'm suggesting we take that as inspiration not copy it exactly. |
Gotcha. Thanks |
To continue what @richardsliu proposed, my suggestion is this. From Job, TFJob or PytorchJob we just need Spec field, where user can specify containers, resources etc., what @gaocegege mentioned above. The API can be like this:
JobKind will be either Job, TFJob or PytorchJob. And we can run corresponding Job depend on JobKind data. |
LGTM but prefer a pointer here.
|
One problem I have is that we would need katib api upgrade for any new Spec support. |
API is compatible since we only add some new kinds. I think the shortcoming is acceptable. |
@gaocegege @johnugeorge Yes, we can update TrialTemplate with the new fields and remove or keep
For me, personally, it is better to change |
I am not sure if we should make it a spec. Since the command in the spec is still a template like:
|
@gaocegege After we will implement JobSpec you think we still need to add Go Template in the command?
|
we'd better not restrict the hyperparameters can only be passed to the job by args. in real use case, user is likely to get them from environment. |
We can add additional field |
TODO: Share the design of Tekton /assign @gaocegege |
Shouldn't the question be whether a Tekton task is the right solution? If the API isn't right then I think it makes sense to design a Katib specific resource but I don't see why we would take on the cost of building another CR if one already exists. Backing up a step. What is the problem this issue is trying to solve? You are trying to define a template resource that would allow Katib to stamp out new resources with specific inputs and obtain the outputs. One of the key design questions appears to be how to handle special resources (e.g. TFJob, PyTorch, etc...). All the proposals in this thread appear to go with a design where TrialTemplate is a union of supported types. This doesn't seem very scalable. Furthermore, how does this solve the problem of reporting outputs back to Katib? An alternative approach would be you support a single resource (like Tekton task) that just runs a container that must conform to a specific API; e.g. all outputs must be written to "/outputs/results.yaml". You can now support any resource (e.g. PyTorch, TFJob, Notebook, etc...) just by writing a Task/container that would act as a bridge. i.e. submit the resource wait for it to complete and then extract metrics and report in format required for Katib. This is extensible. Anyone can write a Task and publish it without requiring any changes to Katib. This allows Katib to focus on orchestrating the hyperparameter tuning and not worrying about supporting individual resources. Reusing an existing resource (e.g. Tekton Task) as opposed to introducing a new resource means that Katib can leverage and reuse Tekton tasks created for other purposes and not force people to write new Tekton tasks. As an example, for our E2E tests we are creating Tekton task to run notebooks. So if Katib supports a Tekton task, then a user could reuse this Tekton task to easily add hyperparameter tuning for any notebook. Similarly, if we define a library of Tekton tasks to run TFJob, PyTorch, etc... then those could be reused with Tekton.
Katib is already installing multiple CRDs and resources. What difference does it make whether some of these resources are created as part of Katib or come from other projects. |
@jlewi Thanks for your comment. It is helpful. I will reply after a deep dive. |
My thoughts after reading through @jlewi's comment:
According to my understanding, we are designing a templated subresource that has these requirements:
And our goals includes:
From the discussion above, it seems like our options are:
Let’s look at these options one by one Option 1: Use a goTemplate to define arbitrary subresource This is what Katib is currently doing. Let’s see how it is addressing the requirements above:
And in terms of the goals:
Option 2: Enumerate subresource types we want to support, so that their definitions can be strongly-typed in the experiment CRD For subresource requirements:
And in terms of the goals:
Option 3: Support a single generic resource that conforms to a generic API, and allow user to freely define how to inject input, report status and emit outputs Here, this subresource (can be a Tekton Task) is a wrapper around the actual "Job" CRD that will be created and monitored. This subresource can be strongly-typed to define the input hyperparameters, output metrics and status. Users can define how to handle input, output and monitoring in the subresources's custom image. For subresource requirements:
In terms of the goals:
Some more thoughts in support of option 3 Actually I always believe hypertuning should act on an established resource (e.g. a registered Tekton Task), instead of defining them from scratch when starting an Experiment. Users will always try to implement and test the child job (e.g. PytorchJob), make sure they are satisfied with its behavior before they tell Katib "OK I'm ready to hypertune this". Moreover, this established resource can be reused by other hypertune Experiments, we can even factor in a tighter definition of its I/O model, using With all above, I think we should support both option 2 for the ease of hypertuning known KubeFlow jobs, and option 3 for extensibility. The trial template will look like below
|
@czheng94 Thanks for your comment. I have one question about your proposal: Where does the user define the trial job? Is it in TrialTemplate.Tfjob? |
@gaocegege yes. And |
After discussion about Trial template we made few suggestions. Using Tekton might be a problem in Katib. For Tekton tasks we still need to define Specification for different jobs (TFJob, MPIJob, PytorchJob), since they have different API structure. Also, using Tekton tasks as Trial Template would need huge Katib controller changes. Another point, Tekton is not part of Kubeflow project, which make harder to follow Tekton API changes, etc. We can implement each job independently which was mentioned above. KF serving also follow this way: https://github.com/kubeflow/kfserving/blob/master/pkg/apis/serving/v1alpha2/inferenceservice_types.go#L93. This make user life much easier. User just needs to know how to define TF/Pytorch job and to tell which parameters need to be tuned. Define Define The structure can be something like this:
Also, do we want to support reading template from ConfigMap ? @johnugeorge @gaocegege Please add if I missed something. |
Issue Label Bot is not confident enough to auto-label this issue. |
LGTM. I think we can come to the agreement. |
Thanks. Some questions (not blocking; please consider me just a bystander).
|
That because operators specification have different structure. For example, TF Operator (https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1/types.go#L47) and MPI Operator (https://github.com/kubeflow/mpi-operator/blob/master/pkg/apis/kubeflow/v1/types.go#L40). I think we want to support all operators functionalities in Trial Template.
In that case we should again convert |
So there are two ways to support this
I think the main advantage of having multiple Trial resources is that its more scalable in terms of code ownership and decision making. For example suppose I think Katib should have first class support for Tekton Tasks or Argo workflows but the Katib team thinks its not right or doesn't have the cycles to implement it or even review the code contributions. In the multiple CR world I can go and implement TektonTrialTemplate and ArgoTrialTemplate and as long it conforms to the Katib Trial Template resource model (e.g. has spec.params and respects certain conditions) then ideally it should just work with Katib. It looks like the issue might be today you are inlining trialTemplate into experiment example. How about adding support for references. e.g
You would need an extensible mechanism(s) to convert templates to actual instances. You could bake in support for the predefined types you want to support (TFJob, MPI, etc...). To make it extensible you do something like use an annotation to point at a lambda func(params, template) that returns a hydrated K8s object. |
Is it common in Kubernetes to use API resource as lambda function to support any kind of Operators/Jobs which follow the way of Katib Trial Template?
Yes, because Trial Template is the part of Experiment API. |
In this case, I am not sure how to define the spec of the TFJob. |
I think duck typing is a common pattern. Here's a doc from knative. I think the problem statement is a really good explanation of why it might be valuable for Katib. In particular, I would like to promote loose coupling between Katib and other bits of Kubeflow (e.g. the Job operators). I'm not aware of anyone using Lambdas. The model Tekton follows is for each resource to have 2 distinct resources:
We could follow the same pattern; e.g.
So Katib would create TFJobTrialRuns from TFJobTrialTemplate. This CR would in turn create TFJobs to execute the actual jobs and collect the status and outputs.
|
What is the advantages for Katib to create separate resource for each job, e.g TFJobTrialRun for TFOperator and PytorchJobTrialRun for PytorchOperator? Trial in Katib has unique structure and difference between various operators only in RunSpec: https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/trials/v1alpha3/trial_types.go#L35. |
The advantage is to make Katib loosely coupled with the individual operators. Right now it is not loosely decoupled. i.e. Katib needs to be explicitly updated to add support for each type of resource.
I'm not familiar with the specifics of your API to know where you separate things out so maybe its not Trial maybe its Runspec. I'm a little confused because I don't see Experiment in trial_types.go and I thought it was the experiment and not the trial that specified the trial and the search space. |
So looking at the current example: It looks like the custom resource is Experiment which embeds TrialTemplate. If we we follow that, I would suggest changing it to something like
The experiment controller would then create TfJobTrialRuns from the TFJobTrialTemplate for each trial. I think this is more extensible because the Experiment controller can use duck typing to support whatever Template/Runs a user has installed on their cluster. So for example if someone wants to use Argo Workflows. They could just implement ArgoTrialTemplate and ArgoTrialRun and install it on their cluster. |
It is because each Experiment CR includes created Trial resources.
In that case, these custom resources should be installing to the cluster in advance? E.g TfJobTrialRuns for TFJobs. Or I misunderstand something? |
@andreyvelich Do you want to add to the agenda for an upcoming community and we can discuss in person?
Ok so TrialTemplate is a field in Experiment
Yes
The API for a TrialTemplate for TFJob looks like the following
The go type is
So the Spec is typed. So I think what I'm proposing isn't that different from what you have today. Basically I'm suggesting the following
|
@jlewi Sure, let's discuss it on Kubeflow meeting. |
I was checking the knative duck implementation and I think that can be one more problem with it. For example, to reach container for TFJob we need to go to If you take a look at knative: https://github.com/knative/pkg/blob/master/apis/duck/v1/implements_test.go#L53, they use VerifyType only where we can directly apply changes to the field. The same story in Kubernetes (https://github.com/kubernetes/api/blob/e771f807/core/v1/types.go#L3179-L3190). PodTemplateSpec is the sub-resource for k8s each resource (Rs, Job, Deployment). I was thinking about another way of supporting various resources. Maybe in Experiment API define:
In Trial API we can just have Object Meta, instead of RunSpec (https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/trials/v1alpha3/trial_types.go#L35), like how it is done in k8s API: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L3590. Or we can have all In k8s, Unstructured must have valid Yaml and Also we will add TrialParamsSpec to API, which @jlewi proposed here: #906 (comment). So we can:
I was testing Unstructured as API in Kuberenetes CRD, it works and we can parse it. For Metrics Collector I believe it will work also, since we mutate created Trial Pod, not Trial resource. This approach can help us with supporting any type of resource in Trial Template. This is example of Experiment:
What do you think @jlewi @johnugeorge @gaocegege ? |
LGTM. I prefer this approach. 🎉 |
Looks good |
Great to see the PR merged 🎉 !! This is so exciting. |
We can close this, we are using new Trial template now: #1208. |
/kind feature
Describe the solution you'd like
[A clear and concise description of what you want to happen.]
We need to marshal the TFJob to JSON string then use it to create experiments if we are using K8s client-go. It is not good. And, go template is ugly, too.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
The text was updated successfully, but these errors were encountered: