-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support early stop feature #692
Comments
/0.7.0 |
/kind feature |
/priority p1 |
Do we have any updates to support Early Stopping in Katib? According to Google Vizier paper, https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf algorithm should analyse metrics from Running Trials and stop them. Right now, Metrics Collector parses metrics only once training process is finished (https://github.com/kubeflow/katib/blob/master/cmd/metricscollector/v1alpha3/file-metricscollector/main.go#L94). Any thoughts @hougangliu @gaocegege @johnugeorge ? |
There was some discussion regarding this before. I couldn't find the issue. @gaocegege |
I have an idea how we can implement Early Stopping in current Katib functionality. Maybe instead of creating independent service for Early Stopping, try to Mutate another container to Trial Pod, like we are doing with metrics collector, and stop main training process when Trial needs to be early stopped. This is an example for median stopping rule:
Ones python3 execution is failed container runs:
Main Training Job will be completed if
In this approach we didn't brake normal Kubernetes Job execution. If training process was failed because of code execution Training job will be failed also. What do you think @gaocegege @johnugeorge ? /cc @jlewi @richardsliu |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
It will be hard to implement some complex logic this way. I think. There are many early stopping algorithms. |
@gaocegege Is there any early stopping algorithms where analysing logs from Training Container can be not enough ? |
Failing a Job might be wrong as it gives a wrong meaning to the user. From what you told, why don't we follow the same control flow as we have now. Katib controller in each iteration takes the decision whether early stopping condition is met and marks the experiment successful with message saying that early stopping condition is met. |
In case of early stopping, Job will be succeeded after we kill the process.
How Katib controller can get information about each iteration? Currently, metrics collector parses logs only ones Training Job is completed. And Trial controller watches only for Training Job changes: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/trial/trial_controller.go#L97. |
After Katib meeting discussion we have few thoughts about this issue:
/priority p0 |
In Optuna, user reports intermediate values via Python API, then stop to train models if pruner is triggered. import optuna
def objective(trial):
for epoch in range(10):
# 1. Train a model
...
# 2. Evaluate loss.
...
# 3. Report an intermediate metric value on each epochs.
trial.report(accuracy, step=epoch)
# 4. Ask SuccessiveHalvingPruner() to tell whether this trial should stop or continue.
if trial.should_prune():
raise optuna.exceptions.TrialPruned()
...
return accuracy # final metric score
if __name__ == '__main__':
study = optuna.create_study(
sampler=optuna.samplers.TPESampler(), # corresponds to suggestion service in Katib.
pruner=optuna.pruners.SuccessiveHalvingPruner(),
)
study.optimize(objective, n_trials=100) https://github.com/optuna/optuna/blob/master/examples/pytorch_simple.py |
Thank you for this example! As I can see, they also have some sort of SDK for it. Do you know, how do they stop the training and mark Trials |
Yes. They just raise TrialPruned exception inside their objective function, then Optuna catch it and mark the trial # Create a new trial.
trial = ...
try:
# Call an objective function
result = func(trial)
except exceptions.TrialPruned as e:
# Mark the trial `PRUNED`
...
self._storage.set_trial_state(trial_id, TrialState.PRUNED)
...
except Exception as e:
# Mark the trial `FAILED`
...
self._storage.set_trial_state(trial_id, TrialState.FAIL) |
I was testing this approach: #692 (comment). Here: https://github.com/kubeflow/katib/blob/master/cmd/metricscollector/v1beta1/file-metricscollector/main.go#L67-L83, we can analyse require information for the Early Stopping and Kill appropriate running training process. It is working for simple Batch Job and PyTorch Job. I believe Ray also uses SDK for early stopping: https://docs.ray.io/en/ray-0.4.0/hyperband.html#median-stopping-rule. Do you have any other ideas or thoughts about Early Stopping in Katib @gaocegege @johnugeorge @c-bata @sperlingxx ? |
Let's continue discussion in #1330. |
Early stop should not only consider goal metrics, but also epoch, step and so on
The text was updated successfully, but these errors were encountered: