-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp: Checkpoints created during dvc exp run --temp
run are lost after failure (e.g., kill -9
)
#8612
Comments
For comparison, if one would not use DVC's checkpoint feature built on top of Git, checkpoints would be file objects only possibly with different filenames for each checkpoint (e.g., epoch or global step number). Upon failure, those file objects would still exist and could be used to resume training after resolving the underlying issue (e.g. free up system resources; after reboot of GPU server due to other reasons...). When switching to DVC checkpoints, the same should be possible. When executing an experiment directly in the workspace (i.e., without Note that the actual |
I think this is the same as #8624. The issue here is that You should get the expected/desired behavior if you did |
It would. However, in this issue here, the signal is not triggered by me the user. It may be triggered by the OS due to OOM or because the user reached a system wide limit on number of threads. Or because of |
@aschuh-hf This occurs for you with |
I'm not sure, actually, and would have to test (but may not be able to soon). The reported issue was encountered when using |
I can reproduce with |
The problem is in dvc/dvc/repo/experiments/queue/tasks.py Lines 110 to 115 in aa2e830
If the while in dvc/dvc/repo/experiments/queue/workspace.py Lines 117 to 134 in aa2e830
the |
First for this issue:
So, both We can fetch these checkpoints to the workspace just like in live training progress tracking in the vs-code extension. But it's better to complete the result info in the exec info file. But I got a problem during this progress. That is how to mark the task state for an incomplete checkpoint training task. It lefts some results in the workspace, but doesn't finish the plan.
Another difficulty is the relationship between the experiment and the stage. Checkpoint training usually only belongs to one stage, and experiments might be composite by a chain of different stages. What should we do with the following stage if a checkpoint stage is interrupted?
@pmrowla @dberenbaum Any thoughts about this? |
fix: iterative#8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this
@karajan1001 could you please clarify what do you mean by fetching the result to the workspace? Do we merge with the existing changes? Does it happen automatically? What if there multiple failed experiments? How does it solve the problems that you outlined? |
I think there is some confusion over what the desired behavior is right now. When using When using My understanding is that the desired behavior here is to make I think @karajan1001's latest question was regarding how to handle actually saving that failed/intermediate final state. In my opinion, this is not something we should be addressing right now, it would be better to handle it in the future if/when we are able to revisit checkpoint behavior in general. But for now, I think limiting the scope of this issue to "make |
Thanks @pmrowla ! What I'm missing I guess is this part "and fetched into the main dvc repo/workspace ". Is it always the case or only when something fails? By workspace we mean actual workspace right? |
Ah, what I should have said was maybe "fetched into git". Workspace in this case is really referring to "main dvc/git repo" vs "tempdir dvc/dvc repo" (which is used to run Basically the issue is that we are losing the successful checkpoint iterations' git commits+exp ref that get generated by |
@pmrowla got it! thanks. As for the failed experiments. What happens if a regular queued or temp (non checkpoint) exp fails? Do we show it in the table, is there a way to extract it? |
Failed queued experiments are shown now as failed in the table and through the But this only applies to |
fix: iterative#8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this
fix: iterative#8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this
fix: #8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this
Bug Report
Description
I have a long running training stage in my
dvc.yaml
, which uses DVCLive to track metrics and experiment checkpoints by specifyingcheckpoint: true
for the PyTorch model.ckpt
file created by PyTorch LightningsModelCheckpoint
callback. When executing the training usingdvc exp run --temp
, it is run inside a temp folder created in.dvc/tmp/exps/standalone/
. All checkpoint Git objects are stored under.dvc/tmp/exps/standalone/tmpXXX/.git/objects/
. When the training process is interrupted (e.g., OOM, shared memory issue, failure to create new threads due to OS limits), DVC reports the error thatERROR: failed to reproduce 'train': failed to run: ...
and exits. While doing so, it deletes the temp directory in.dvc/tmp/exps/standalone/
and along with it all previously created checkpoints. I cannot find the same checkpoint objects in the.git/objects
folder of the workspace and am unable to recover those checkpoints.Reproduce
dvc.yaml
withtrain
stage running a training script using DVCLive and checkpoints.dvc exp run --temp train
.kill -9
..dvc/tmp/exps/standalone/tmpXXX
folder is gone. No checkpoint objects in workspace (e.g.,dvc exp show
).Expected
Checkpoints should be preserved to be able to recover from failures such as the ones mentioned in the description.
Environment information
Output of
dvc doctor
:Additional Information (if any):
When interrupting the experiment with CTRL+C, the training script is set up to still return a zero exit code such that DVC considers the experiment as successfully executed. In this case, I expect the checkpoints to be preserved before the temp directory is being deleted (but I haven't tested this yet).
The text was updated successfully, but these errors were encountered: