Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for entire kubeflow pipelines as trial target (in addition to containers) #1914

Open
romeokienzler opened this issue Jul 8, 2022 · 18 comments

Comments

@romeokienzler
Copy link

romeokienzler commented Jul 8, 2022

/kind feature

Describe the solution you'd like
Hyper parameters not only affect the training step but also upstream pipeline components like feature transformation for example (e.g. parameters of a normalization transformation). In addition, transformation and training steps should be able to make use of kfp's parallel components (e.g. SparkJob, TFJob, ...). It would be helpful to not only allow to specify containers as trial targets but also complete kubeflow pipelines. As the latter also expose parameters they can be either set directly (non-hyperparameters) or added to the hyper parameter space.

Anything else you would like to add:
I've started to create simple container image which can be used as trial target which acts as a proxy and downstream triggers parameterized Kubeflow pipeline executions with the respective hyper parameters. A Kubernetes Custom Resource can be created as well down the line.


Love this feature? Give it a 👍 We prioritize the features with the most 👍

@romeokienzler
Copy link
Author

Just digged in a little more - actually IMHO we need a new TrialTemplate - as Argo and Tekton are already supported I'll go for one of those initially...

@skliarpawlo
Copy link

skliarpawlo commented Sep 1, 2022

Hi @romeokienzler , I'm also very interested in this feature and it's surprising actually that it wasn't provided already as it sounds natural. I have a question though, I see there is an example of integration argo workflows with katib here kubeflow pipelines are implemented via argo workflows aren't they? If so why can we do this already? I gave it a try and I'm constantly getting some new errors which makes me think they are not compatible.

@votti
Copy link

votti commented Oct 9, 2022

I think I have kubeflow pipelines with Katib - almost - working:
Unfortunately I have not a minimal example yet I set out to translate an existing ML example to run as pipeline with Katib - just assuming that this works - before discovering that this was not straight forward.

I followed the Argo/Katib installation instruction (https://github.com/kubeflow/katib/tree/master/examples/v1beta1/argo). Then I for now manually generated a manifest based on the example provided there, using a Kubeflow pipeline.

Adaptions

My example file is attached. To reproduce in your own env, also adapt the serviceAccountName and namespace.

I found that currently the reason this does not work is, that when the pod with the metrics collector sidecar is setup, the run command is re-written to be a single command.

In the Argo example, Katib re-writes:

            command:
            - python3
            - /opt/mxnet-mnist/mnist.py
            - --lr=${trialParameters.learningRate}
            - --num-examples={{inputs.parameters.num-examples}}

into

      /var/run/argo/argoexec emissary -- python3 /opt/mxnet-mnist/mnist.py --lr=0.025074661653484223 --num-examples=3157 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid

Now as for pipelines this commands are much more complex, I believe this re-writing breaks them.
For my example this leads to a syntax error:

sh: 2: cannot create : Directory nonexistent
/opt/conda/bin/python3: can't find '__main__' module in '/home/jovyan/'
sh: 4: Syntax error: "(" unexpected

Removing the katib.kubeflow.org/model-training: "true" label from the training step makes this step run correctly, but Katib failing (as no metrics can be collected.

Anyways, I think if this re-writing of the command for metrics collection would be fixed, combining pipelines with Katib should work.

Run with:

  • Argo v3.3.8
  • Katib v0.14.0

Example pipeline: katib-pipeline.txt
Based on:
original_pipeline.txt

Sorry for making this hasty/sloppy report. I am working on another deadline but thought reporting this even in this state may be interesting.

@votti
Copy link

votti commented Oct 17, 2022

Hey all,

Now managed to get the Workflows running:
I worked-around my issue with complex commands being broken by the adaptation for the metrics collector sidecar: writing the metrics out into a file and then have a separate, super simple pipeline step to just print the file did the trick

Two issues I run into:

  • the container to print was starting faster than the metrics collector sidecar, causing errors. Dirty workaround: using sleep before printing with cat
  • Caching: the pipeline step printing the metrics needs to have caching turned of to be guaranteed to run every time

A working example can be found here:
working_workflow.txt

What I found nice is that caching works as one would hope to expect: so only the steps affected by the parameter tuning and downstream will be re-run. Eg initial data loading/parsing/... will be only run once.
To me for prototyping this could come handy.

For me the next steps will be:

  • compile my example in a python script (currently I directly modified the yaml to play around)
  • generate a minimal example that generates and runs a Katib parameter tuning on a pipeline
  • generate an issue that sidecar injection should not break complex commands -> also with minimal example

@johnugeorge johnugeorge mentioned this issue Nov 2, 2022
13 tasks
@votti
Copy link

votti commented Nov 17, 2022

Hi there,
Just an update: I have now a working example that can be run from a notebook that:

  1. Defines a Kubeflow Pipeline
  2. Runs Katib over the whole pipeline, tuning the pipeline parameters.

I am currently re-writing this to a simple MNIST example.
Would it make sense to prepare directly a branch with such a documented Example in the Example folder, or would you prefer me to prepare the example in a separate repository?

Something I was wondering is, if it wouldnt make sense, that to have a metrics collector that works with Kubeflow pipeline metrics artifacts.
This would make it even more seamless to put a Katib job over an existing pipeline (will open a new issue).

@votti
Copy link

votti commented Nov 17, 2022

I started here a separate repo where I am developing the mnist example:
https://github.com/votti/katib-exploration/blob/main/notebooks/mnist_pipeline_v1.ipynb

The general approach seems to work and the pipeline produces a yaml that can be submitted to Katib.

Unfortunately I need to adapt things for v2 kubeflow pipeline syntax, thus this is with v1.
Note that this requires the Argo integration setup as described: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/argo/README.md

@johnugeorge
Copy link
Member

@votti This is great start. Thank you for taking this. We can add this MNIST example to https://github.com/kubeflow/katib/tree/master/examples/v1beta1/kubeflow-pipelines

Few items to be completed before merge

  1. Have a clean solution for A metrics collector for Kubeflow Pipeline Metrics artifacts #2019
  2. v2 kfp support
  3. Submit using sdk instead of yaml

@votti
Copy link

votti commented Nov 19, 2022

Cool, I am happy to push a bit more on this.
@1 As discussed #2019 I think something analogous to the TFevent metrics collector could work
@2 I need to investigate how the parameters would be passed in v2
@3 This definitely also sounds feasible

@andreyvelich
Copy link
Member

Thank you for this great examples @votti!
We recently made changes to Katib SDK: #2075, so it should be easier to update your examples.
Do you have some bandwidth to contribute your example before 0.15 Katib release (we will cut the first RC on Jan 25th) ?
If not, we can post-pone this to the next release.

@votti
Copy link

votti commented Jan 18, 2023

Hi @andreyvelich I will try to find time, but currently am a bit stopped as I lost access to my previous Kubeflow environment and am now moving to a local installation. Lets see.

@andreyvelich
Copy link
Member

Sure, no rush @votti. We can contribute this example even after the release.

@votti
Copy link

votti commented Feb 9, 2023

Currently I am struggling to get a setup where KFP V2 is properly working to build compatibility with the new PipelineSpec format.
Also what I have not found yet is how and if the PipelineSpec format can be compiled to an Argo Workflow Template. With my current solution this would need to be required, as I am leaning on the Katib Controller to run Argo Workflows.
I have not understood yet/seen documented what kind of resource KFP V2 runs the pipelines on per default (I suspect this still being Argo by default).
Would somebody have some insights on this topic?

While above point is really unclear to me and hard to explore due to my lacking working KFP V2 setup, I am now moving on to polish the solution for KFP V1 and building the metrics collector (#2019)

@AlexandreBrown
Copy link

AlexandreBrown commented Feb 9, 2023

@votti Maybe @zijianjoy, @chensun, @connor-mccarthy @Linchin has some insights on KFP v2 setup?

votti added a commit to votti/katib-exploration that referenced this issue Feb 10, 2023
This uses the new custom KFP V1 metrics collector that can directly extract
metrics from Kubeflow Pipeline metrics.

With this collector, to measure metrics of a Kubeflow pipeline
only requires to a) add the label which step is the `model-training`
b) diseable caching for this step
c) configure the katib metrics collector.

Also all the information is added now, such that the Katib pipeline
can be run via the KatibClient.

Addresses:
- kubeflow/katib#1914
- kubeflow/katib#2019

k
@votti
Copy link

votti commented Feb 10, 2023

Hi everyone,

I have now a working implementation based on a custom metrics collector:
https://github.com/votti/katib-exploration/blob/main/notebooks/mnist_pipeline_v1.ipynb

Thus the following points are addressed:

Currently the v2 kfp support is a bit of a moving target for me, as I do not manage to have a fully working installation of KFP=2.0.0a/b.
What I have seen so far I think it may be by-and-large possible to use a similar metrics collection strategy, but what is not clear to me how to produce the Argo CRD or what kind of CRD could be produced instead.

I think as a first step supporting V1 properly should already be a good progress form the current situation.

Next I will integrate my example into the katib repository example together with my implementation of the KFP V1 Metrics collector (https://github.com/d-one/katib/tree/feature/kfpv1-metricscollector/cmd/metricscollector/v1beta1/kfpv1-metricscollector).

Given how sparse the documentation/examples of a custom metrics collector currently is, I hope maybe also add a bit on the documentation of this process.

votti added a commit to d-one/katib that referenced this issue Feb 15, 2023
This example illustrates how a full kfp pipeline can
be tuned using Katib.

It is based on a metrics collector to collect kubeflow
pipeline metrics (kubeflow#2019). This is used as a Custom Collector.

Addresses: kubeflow#1914, kubeflow#2019
votti added a commit to d-one/katib that referenced this issue Jul 18, 2023
This example illustrates how a full kfp pipeline can
be tuned using Katib.

It is based on a metrics collector to collect kubeflow
pipeline metrics (kubeflow#2019). This is used as a Custom Collector.

Addresses: kubeflow#1914, kubeflow#2019
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

tenzen-y commented Sep 7, 2023

/remove-lifecycle stale

Copy link

github-actions bot commented Dec 6, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

tenzen-y commented Dec 6, 2023

/lifecycle frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants