support early stop feature #692

hougangliu · 2019-07-19T07:47:03Z

Early stop should not only consider goal metrics, but also epoch, step and so on

gaocegege · 2019-07-22T02:19:24Z

/0.7.0

gaocegege · 2019-10-11T03:07:41Z

/kind feature

gaocegege · 2019-11-12T03:41:49Z

/priority p1

andreyvelich · 2020-01-22T15:43:26Z

Do we have any updates to support Early Stopping in Katib?
I think firstly we should try to implement Median Stopping Rule as @richardsliu noticed here #936 (comment).

According to Google Vizier paper, https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf algorithm should analyse metrics from Running Trials and stop them.
In our case, Trial controller should process Trial metrics during training process and send them to Early Stopping service.

Right now, Metrics Collector parses metrics only once training process is finished (https://github.com/kubeflow/katib/blob/master/cmd/metricscollector/v1alpha3/file-metricscollector/main.go#L94).
So we should somehow send metrics from Trials to Early Stopping service.

Any thoughts @hougangliu @gaocegege @johnugeorge ?

johnugeorge · 2020-02-02T16:17:12Z

There was some discussion regarding this before. I couldn't find the issue. @gaocegege

andreyvelich · 2020-05-08T22:56:21Z

I have an idea how we can implement Early Stopping in current Katib functionality.

Maybe instead of creating independent service for Early Stopping, try to Mutate another container to Trial Pod, like we are doing with metrics collector, and stop main training process when Trial needs to be early stopped.

This is an example for median stopping rule:

Katib is training and reporting metrics for X steps.
After X steps Suggestion generates median rule from reported metrics.
Katib controller creates Trial job and injects to training pod Early Stopping container with appropriate parameters (e.g stop_rule = accuracy >= 0.7). We can use same injection webhook as now.
We add additional commands to Training Container to handle kill Linux process.
For example

python3 mnist.py --num-layers=2 1 > /var/log/katib/metrics.log 
|| if test -f '/var/log/katib/early-stopping'; then 
         echo 'Training Container was Early Stopped'; 
      else 
         echo 'Training Container was failed'; 
         exit 1; 
    fi; 
&& echo completed > /var/log/katib/$$$$.pid

Ones python3 execution is failed container runs:

if test -f '/var/log/katib/early-stopping'; then 
         echo 'Training Container was Early Stopped'; 
      else 
         echo 'Training Container was failed'; 
         exit 1;
fi;

Early Stopping is tailing metrics log file or TF event metrics log (for TFEvent metrics collector).
When Early Stopping reads metrics that < 0.7, Early Stopping gets appropriate training process PID. I believe it is possible, since metrics collector gets it: https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1alpha3/common/pns.go#L4.
After that, Early Stopping container:

Stop training Linux process execution: pkill -STOP -s <PID>.
Create file under /var/log/katib/early-stopping. Training, Metrics Collector and Early Stopping containers will share the same EmptyDir.
Kill training process pkill -KILL -s <PID>.

Main Training Job will be completed if /var/log/katib/early-stopping file exists or failed
(exit 1 on else) if file doesn't exist.

Then metrics collector checks if training container was early stopped (If /var/log/katib/early-stopping file exists).
In observation log, I proposed to create another field: is_early_stopped to indicate whether Trial was early stopped or not. We can create additional table for it.
If training container was early stopped, metrics collector sends metrics to DB and is_early_stopped = 1. I think it is useful to report metrics, even if Trial was early stopped.
is_early_stopped parameter can be used later for Katib Controller to change Trial status and not report them to Suggestion.

In this approach we didn't brake normal Kubernetes Job execution. If training process was failed because of code execution Training job will be failed also.
I tested, it is possible to stop and kill Training process from injected container (metrics collector).
I hope, this approach should work not only for python training containers, since we are working with Linux processes.

What do you think @gaocegege @johnugeorge ?

/cc @jlewi @richardsliu

issue-label-bot · 2020-05-08T22:56:29Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.99

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

gaocegege · 2020-05-09T08:10:18Z

It will be hard to implement some complex logic this way. I think. There are many early stopping algorithms.

andreyvelich · 2020-05-09T15:34:22Z

@gaocegege Is there any early stopping algorithms where analysing logs from Training Container can be not enough ?

johnugeorge · 2020-05-10T05:02:43Z

Failing a Job might be wrong as it gives a wrong meaning to the user.

From what you told, why don't we follow the same control flow as we have now. Katib controller in each iteration takes the decision whether early stopping condition is met and marks the experiment successful with message saying that early stopping condition is met.

andreyvelich · 2020-05-10T14:48:15Z

Failing a Job might be wrong as it gives a wrong meaning to the user.

In case of early stopping, Job will be succeeded after we kill the process.

From what you told, why don't we follow the same control flow as we have now. Katib controller in each iteration takes the decision whether early stopping condition is met and marks the experiment successful with message saying that early stopping condition is met.

How Katib controller can get information about each iteration? Currently, metrics collector parses logs only ones Training Job is completed. And Trial controller watches only for Training Job changes: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/trial/trial_controller.go#L97.

andreyvelich · 2020-06-30T13:42:01Z

After Katib meeting discussion we have few thoughts about this issue:

Create Katib Python SDK for Early Stopping.
Users can directly push appropriate parameters from the Trial training container pod to the Early Stopping service.
After that, service can terminate pod.
In that approach, users need to manually change training container code and create docker image for it.
Try to release suggestion, proposed here: support early stop feature #692 (comment).
In that case, users need to print all information that early stopping service should analyse (epochs, accuracy, etc..).
Instead of creating another sidecar container, add this logic to metrics collector container, since the early stopping also parses logs.
Investigate more about early stopping in other AutoML projects.

/cc @gaocegege @johnugeorge

/priority p0

c-bata · 2020-06-30T14:30:56Z

Investigate more about early stopping in other AutoML projects.

In Optuna, user reports intermediate values via Python API, then stop to train models if pruner is triggered.

import optuna


def objective(trial):
    for epoch in range(10):
        # 1. Train a model
        ...

        # 2. Evaluate loss.
        ...

        # 3. Report an intermediate metric value on each epochs.
        trial.report(accuracy, step=epoch)

        # 4. Ask SuccessiveHalvingPruner() to tell whether this trial should stop or continue.
        if trial.should_prune():
             raise optuna.exceptions.TrialPruned()

    ...
    return accuracy  # final metric score

if __name__ == '__main__':
    study = optuna.create_study(
        sampler=optuna.samplers.TPESampler(),  # corresponds to suggestion service in Katib.
        pruner=optuna.pruners.SuccessiveHalvingPruner(),
    )
    study.optimize(objective, n_trials=100)

https://github.com/optuna/optuna/blob/master/examples/pytorch_simple.py

andreyvelich · 2020-06-30T16:50:54Z

In Optuna, user reports intermediate values via Python API, then stop to train models if pruner is triggered.

import optuna


def objective(trial):
    for epoch in range(10):
        # 1. Train a model
        ...

        # 2. Evaluate loss.
        ...

        # 3. Report an intermediate metric value on each epochs.
        trial.report(accuracy, step=epoch)

        # 4. Ask SuccessiveHalvingPruner() to tell whether this trial should stop or continue.
        if trial.should_prune():
             raise optuna.exceptions.TrialPruned()


if __name__ == '__main__':
    study = optuna.create_study(
        sampler=optuna.samplers.TPESampler(),  # corresponds to suggestion service in Katib.
        pruner=optuna.pruners.SuccessiveHalvingPruner(),
    )
    study.optimize(objective, n_trials=100)

https://github.com/optuna/optuna/blob/master/examples/pytorch_simple.py

Thank you for this example! As I can see, they also have some sort of SDK for it.

Do you know, how do they stop the training and mark Trials Pruned?
To stop they just raise optuna.exceptions.TrialPruned() exception in the objective function?

c-bata · 2020-06-30T17:18:31Z

Do you know, how do they stop the training and mark Trials Pruned?
To stop they just raise optuna.exceptions.TrialPruned() exception in the objective function?

Yes. They just raise TrialPruned exception inside their objective function, then Optuna catch it and mark the trial Pruned at following lines.

        # Create a new trial.
        trial = ...

        try:
            # Call an objective function
            result = func(trial)
        except exceptions.TrialPruned as e:
            # Mark the trial `PRUNED`
            ...
            self._storage.set_trial_state(trial_id, TrialState.PRUNED)
            ...
        except Exception as e:
            # Mark the trial `FAILED`
            ...
            self._storage.set_trial_state(trial_id, TrialState.FAIL)

https://github.com/optuna/optuna/blob/6a1674666a5d8d778b39b902627ff40e98e14c48/optuna/study.py#L685-L700

andreyvelich · 2020-07-08T21:01:31Z

I was testing this approach: #692 (comment).
For file metrics collector and StdOut, it was working and we can implement it.

Here: https://github.com/kubeflow/katib/blob/master/cmd/metricscollector/v1beta1/file-metricscollector/main.go#L67-L83, we can analyse require information for the Early Stopping and Kill appropriate running training process.

It is working for simple Batch Job and PyTorch Job.

I believe Ray also uses SDK for early stopping: https://docs.ray.io/en/ray-0.4.0/hyperband.html#median-stopping-rule.

Do you have any other ideas or thoughts about Early Stopping in Katib @gaocegege @johnugeorge @c-bata @sperlingxx ?

andreyvelich · 2020-10-17T01:38:26Z

Let's continue discussion in #1330.

k8s-ci-robot added the kind/feature label Oct 11, 2019

k8s-ci-robot added the priority/p1 label Nov 12, 2019

richardsliu mentioned this issue Nov 28, 2019

[feature] Early Stopping Spec #936

Closed

johnugeorge mentioned this issue Apr 1, 2020

[discussion] Katib 2020 Roadmap #1104

Closed

issue-label-bot bot added the area/katib label May 8, 2020

k8s-ci-robot added the priority/p0 label Jun 30, 2020

andreyvelich removed the priority/p1 label Jun 30, 2020

andreyvelich mentioned this issue Jul 21, 2020

Proposal: Support custom CRD in Trial Job #1273

Merged

andreyvelich mentioned this issue Aug 21, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

andreyvelich mentioned this issue Sep 8, 2020

Early stopping implementation #1330

Closed

andreyvelich closed this as completed Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support early stop feature #692

support early stop feature #692

hougangliu commented Jul 19, 2019 •

edited

Loading

gaocegege commented Jul 22, 2019

gaocegege commented Oct 11, 2019

gaocegege commented Nov 12, 2019

andreyvelich commented Jan 22, 2020

johnugeorge commented Feb 2, 2020

andreyvelich commented May 8, 2020

issue-label-bot bot commented May 8, 2020

gaocegege commented May 9, 2020

andreyvelich commented May 9, 2020

johnugeorge commented May 10, 2020 •

edited

Loading

andreyvelich commented May 10, 2020 •

edited

Loading

andreyvelich commented Jun 30, 2020

c-bata commented Jun 30, 2020 •

edited

Loading

andreyvelich commented Jun 30, 2020

c-bata commented Jun 30, 2020 •

edited

Loading

andreyvelich commented Jul 8, 2020

andreyvelich commented Oct 17, 2020

support early stop feature #692

support early stop feature #692

Comments

hougangliu commented Jul 19, 2019 • edited Loading

gaocegege commented Jul 22, 2019

gaocegege commented Oct 11, 2019

gaocegege commented Nov 12, 2019

andreyvelich commented Jan 22, 2020

johnugeorge commented Feb 2, 2020

andreyvelich commented May 8, 2020

issue-label-bot bot commented May 8, 2020

gaocegege commented May 9, 2020

andreyvelich commented May 9, 2020

johnugeorge commented May 10, 2020 • edited Loading

andreyvelich commented May 10, 2020 • edited Loading

andreyvelich commented Jun 30, 2020

c-bata commented Jun 30, 2020 • edited Loading

andreyvelich commented Jun 30, 2020

c-bata commented Jun 30, 2020 • edited Loading

andreyvelich commented Jul 8, 2020

andreyvelich commented Oct 17, 2020

hougangliu commented Jul 19, 2019 •

edited

Loading

johnugeorge commented May 10, 2020 •

edited

Loading

andreyvelich commented May 10, 2020 •

edited

Loading

c-bata commented Jun 30, 2020 •

edited

Loading

c-bata commented Jun 30, 2020 •

edited

Loading