Add support for entire kubeflow pipelines as trial target (in addition to containers) #1914

romeokienzler · 2022-07-08T06:55:36Z

/kind feature

Describe the solution you'd like
Hyper parameters not only affect the training step but also upstream pipeline components like feature transformation for example (e.g. parameters of a normalization transformation). In addition, transformation and training steps should be able to make use of kfp's parallel components (e.g. SparkJob, TFJob, ...). It would be helpful to not only allow to specify containers as trial targets but also complete kubeflow pipelines. As the latter also expose parameters they can be either set directly (non-hyperparameters) or added to the hyper parameter space.

Anything else you would like to add:
I've started to create simple container image which can be used as trial target which acts as a proxy and downstream triggers parameterized Kubeflow pipeline executions with the respective hyper parameters. A Kubernetes Custom Resource can be created as well down the line.

Love this feature? Give it a 👍 We prioritize the features with the most 👍

romeokienzler · 2022-07-14T10:13:27Z

Just digged in a little more - actually IMHO we need a new TrialTemplate - as Argo and Tekton are already supported I'll go for one of those initially...

skliarpawlo · 2022-09-01T19:33:16Z

Hi @romeokienzler , I'm also very interested in this feature and it's surprising actually that it wasn't provided already as it sounds natural. I have a question though, I see there is an example of integration argo workflows with katib here kubeflow pipelines are implemented via argo workflows aren't they? If so why can we do this already? I gave it a try and I'm constantly getting some new errors which makes me think they are not compatible.

votti · 2022-10-09T23:09:49Z

I think I have kubeflow pipelines with Katib - almost - working:
Unfortunately I have not a minimal example yet I set out to translate an existing ML example to run as pipeline with Katib - just assuming that this works - before discovering that this was not straight forward.

I followed the Argo/Katib installation instruction (https://github.com/kubeflow/katib/tree/master/examples/v1beta1/argo). Then I for now manually generated a manifest based on the example provided there, using a Kubeflow pipeline.

Adaptions

replacing the trialSpec from a pipeline into: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/argo/argo-workflow.yaml
setting up the trialParameters
manually adding the primaryPodLabel katib.kubeflow.org/model-training: "true" to the main training step

My example file is attached. To reproduce in your own env, also adapt the serviceAccountName and namespace.

I found that currently the reason this does not work is, that when the pod with the metrics collector sidecar is setup, the run command is re-written to be a single command.

In the Argo example, Katib re-writes:

            command:
            - python3
            - /opt/mxnet-mnist/mnist.py
            - --lr=${trialParameters.learningRate}
            - --num-examples={{inputs.parameters.num-examples}}

into

      /var/run/argo/argoexec emissary -- python3 /opt/mxnet-mnist/mnist.py --lr=0.025074661653484223 --num-examples=3157 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid

Now as for pipelines this commands are much more complex, I believe this re-writing breaks them.
For my example this leads to a syntax error:

sh: 2: cannot create : Directory nonexistent
/opt/conda/bin/python3: can't find '__main__' module in '/home/jovyan/'
sh: 4: Syntax error: "(" unexpected

Removing the katib.kubeflow.org/model-training: "true" label from the training step makes this step run correctly, but Katib failing (as no metrics can be collected.

Anyways, I think if this re-writing of the command for metrics collection would be fixed, combining pipelines with Katib should work.

Run with:

Argo v3.3.8
Katib v0.14.0

Example pipeline: katib-pipeline.txt
Based on:
original_pipeline.txt

Sorry for making this hasty/sloppy report. I am working on another deadline but thought reporting this even in this state may be interesting.

votti · 2022-10-17T20:51:45Z

Hey all,

Now managed to get the Workflows running:
I worked-around my issue with complex commands being broken by the adaptation for the metrics collector sidecar: writing the metrics out into a file and then have a separate, super simple pipeline step to just print the file did the trick

Two issues I run into:

the container to print was starting faster than the metrics collector sidecar, causing errors. Dirty workaround: using sleep before printing with cat
Caching: the pipeline step printing the metrics needs to have caching turned of to be guaranteed to run every time

A working example can be found here:
working_workflow.txt

What I found nice is that caching works as one would hope to expect: so only the steps affected by the parameter tuning and downstream will be re-run. Eg initial data loading/parsing/... will be only run once.
To me for prototyping this could come handy.

For me the next steps will be:

compile my example in a python script (currently I directly modified the yaml to play around)
generate a minimal example that generates and runs a Katib parameter tuning on a pipeline
generate an issue that sidecar injection should not break complex commands -> also with minimal example

votti · 2022-11-17T20:19:27Z

Hi there,
Just an update: I have now a working example that can be run from a notebook that:

Defines a Kubeflow Pipeline
Runs Katib over the whole pipeline, tuning the pipeline parameters.

I am currently re-writing this to a simple MNIST example.
Would it make sense to prepare directly a branch with such a documented Example in the Example folder, or would you prefer me to prepare the example in a separate repository?

Something I was wondering is, if it wouldnt make sense, that to have a metrics collector that works with Kubeflow pipeline metrics artifacts.
This would make it even more seamless to put a Katib job over an existing pipeline (will open a new issue).

votti · 2022-11-17T23:02:18Z

I started here a separate repo where I am developing the mnist example:
https://github.com/votti/katib-exploration/blob/main/notebooks/mnist_pipeline_v1.ipynb

The general approach seems to work and the pipeline produces a yaml that can be submitted to Katib.

Unfortunately I need to adapt things for v2 kubeflow pipeline syntax, thus this is with v1.
Note that this requires the Argo integration setup as described: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/argo/README.md

johnugeorge · 2022-11-18T10:11:13Z

@votti This is great start. Thank you for taking this. We can add this MNIST example to https://github.com/kubeflow/katib/tree/master/examples/v1beta1/kubeflow-pipelines

Few items to be completed before merge

Have a clean solution for A metrics collector for Kubeflow Pipeline Metrics artifacts #2019
v2 kfp support
Submit using sdk instead of yaml

votti · 2022-11-19T02:19:44Z

Cool, I am happy to push a bit more on this.
@1 As discussed #2019 I think something analogous to the TFevent metrics collector could work
@2 I need to investigate how the parameters would be passed in v2
@3 This definitely also sounds feasible

andreyvelich · 2023-01-16T17:13:34Z

Thank you for this great examples @votti!
We recently made changes to Katib SDK: #2075, so it should be easier to update your examples.
Do you have some bandwidth to contribute your example before 0.15 Katib release (we will cut the first RC on Jan 25th) ?
If not, we can post-pone this to the next release.

votti · 2023-01-18T21:32:28Z

Hi @andreyvelich I will try to find time, but currently am a bit stopped as I lost access to my previous Kubeflow environment and am now moving to a local installation. Lets see.

andreyvelich · 2023-01-19T10:38:47Z

Sure, no rush @votti. We can contribute this example even after the release.

votti · 2023-02-09T15:00:57Z

Currently I am struggling to get a setup where KFP V2 is properly working to build compatibility with the new PipelineSpec format.
Also what I have not found yet is how and if the PipelineSpec format can be compiled to an Argo Workflow Template. With my current solution this would need to be required, as I am leaning on the Katib Controller to run Argo Workflows.
I have not understood yet/seen documented what kind of resource KFP V2 runs the pipelines on per default (I suspect this still being Argo by default).
Would somebody have some insights on this topic?

While above point is really unclear to me and hard to explore due to my lacking working KFP V2 setup, I am now moving on to polish the solution for KFP V1 and building the metrics collector (#2019)

AlexandreBrown · 2023-02-09T15:44:40Z

@votti Maybe @zijianjoy, @chensun, @connor-mccarthy @Linchin has some insights on KFP v2 setup?

This uses the new custom KFP V1 metrics collector that can directly extract metrics from Kubeflow Pipeline metrics. With this collector, to measure metrics of a Kubeflow pipeline only requires to a) add the label which step is the `model-training` b) diseable caching for this step c) configure the katib metrics collector. Also all the information is added now, such that the Katib pipeline can be run via the KatibClient. Addresses: - kubeflow/katib#1914 - kubeflow/katib#2019 k

votti · 2023-02-10T17:30:08Z

Hi everyone,

I have now a working implementation based on a custom metrics collector:
https://github.com/votti/katib-exploration/blob/main/notebooks/mnist_pipeline_v1.ipynb

Thus the following points are addressed:

Have a clean solution for A metrics collector for Kubeflow Pipeline Metrics artifacts #2019
v2 kfp support
Submit using sdk instead of yaml

Currently the v2 kfp support is a bit of a moving target for me, as I do not manage to have a fully working installation of KFP=2.0.0a/b.
What I have seen so far I think it may be by-and-large possible to use a similar metrics collection strategy, but what is not clear to me how to produce the Argo CRD or what kind of CRD could be produced instead.

I think as a first step supporting V1 properly should already be a good progress form the current situation.

Next I will integrate my example into the katib repository example together with my implementation of the KFP V1 Metrics collector (https://github.com/d-one/katib/tree/feature/kfpv1-metricscollector/cmd/metricscollector/v1beta1/kfpv1-metricscollector).

Given how sparse the documentation/examples of a custom metrics collector currently is, I hope maybe also add a bit on the documentation of this process.

This example illustrates how a full kfp pipeline can be tuned using Katib. It is based on a metrics collector to collect kubeflow pipeline metrics (kubeflow#2019). This is used as a Custom Collector. Addresses: kubeflow#1914, kubeflow#2019

github-actions · 2023-08-26T00:16:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2023-09-07T05:29:00Z

/remove-lifecycle stale

github-actions · 2023-12-06T10:05:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2023-12-06T10:39:43Z

/lifecycle frozen

google-oss-prow bot added the kind/feature label Jul 8, 2022

romeokienzler closed this as completed Jul 8, 2022

romeokienzler reopened this Jul 8, 2022

johnugeorge mentioned this issue Nov 2, 2022

Katib v0.15.0 Roadmap #1993

Closed

13 tasks

votti mentioned this issue Nov 17, 2022

A metrics collector for Kubeflow Pipeline Metrics artifacts #2019

Open

votti mentioned this issue Feb 15, 2023

Feature/Example for training KFP v1 #2118

Closed

1 task

andreyvelich mentioned this issue Jul 28, 2023

Support Kubernetes Sidecars for Katib Metrics Collectors #2181

Open

github-actions bot added the lifecycle/stale label Aug 26, 2023

google-oss-prow bot removed the lifecycle/stale label Sep 7, 2023

github-actions bot added the lifecycle/stale label Dec 6, 2023

google-oss-prow bot added lifecycle/frozen and removed lifecycle/stale labels Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for entire kubeflow pipelines as trial target (in addition to containers) #1914

Add support for entire kubeflow pipelines as trial target (in addition to containers) #1914

romeokienzler commented Jul 8, 2022 •

edited

Loading

romeokienzler commented Jul 14, 2022

skliarpawlo commented Sep 1, 2022 •

edited

Loading

votti commented Oct 9, 2022 •

edited

Loading

votti commented Oct 17, 2022 •

edited

Loading

votti commented Nov 17, 2022 •

edited

Loading

votti commented Nov 17, 2022 •

edited

Loading

johnugeorge commented Nov 18, 2022

votti commented Nov 19, 2022

andreyvelich commented Jan 16, 2023

votti commented Jan 18, 2023

andreyvelich commented Jan 19, 2023

votti commented Feb 9, 2023

AlexandreBrown commented Feb 9, 2023 •

edited

Loading

votti commented Feb 10, 2023 •

edited

Loading

github-actions bot commented Aug 26, 2023

tenzen-y commented Sep 7, 2023

github-actions bot commented Dec 6, 2023

tenzen-y commented Dec 6, 2023

Add support for entire kubeflow pipelines as trial target (in addition to containers) #1914

Add support for entire kubeflow pipelines as trial target (in addition to containers) #1914

Comments

romeokienzler commented Jul 8, 2022 • edited Loading

romeokienzler commented Jul 14, 2022

skliarpawlo commented Sep 1, 2022 • edited Loading

votti commented Oct 9, 2022 • edited Loading

votti commented Oct 17, 2022 • edited Loading

votti commented Nov 17, 2022 • edited Loading

votti commented Nov 17, 2022 • edited Loading

johnugeorge commented Nov 18, 2022

votti commented Nov 19, 2022

andreyvelich commented Jan 16, 2023

votti commented Jan 18, 2023

andreyvelich commented Jan 19, 2023

votti commented Feb 9, 2023

AlexandreBrown commented Feb 9, 2023 • edited Loading

votti commented Feb 10, 2023 • edited Loading

github-actions bot commented Aug 26, 2023

tenzen-y commented Sep 7, 2023

github-actions bot commented Dec 6, 2023

tenzen-y commented Dec 6, 2023

romeokienzler commented Jul 8, 2022 •

edited

Loading

skliarpawlo commented Sep 1, 2022 •

edited

Loading

votti commented Oct 9, 2022 •

edited

Loading

votti commented Oct 17, 2022 •

edited

Loading

votti commented Nov 17, 2022 •

edited

Loading

votti commented Nov 17, 2022 •

edited

Loading

AlexandreBrown commented Feb 9, 2023 •

edited

Loading

votti commented Feb 10, 2023 •

edited

Loading