exp: Checkpoints created during `dvc exp run --temp` run are lost after failure (e.g., `kill -9`) #8612

aschuh-hf · 2022-11-23T23:23:03Z

Bug Report

Description

I have a long running training stage in my dvc.yaml, which uses DVCLive to track metrics and experiment checkpoints by specifying checkpoint: true for the PyTorch model .ckpt file created by PyTorch Lightnings ModelCheckpoint callback. When executing the training using dvc exp run --temp, it is run inside a temp folder created in .dvc/tmp/exps/standalone/. All checkpoint Git objects are stored under .dvc/tmp/exps/standalone/tmpXXX/.git/objects/. When the training process is interrupted (e.g., OOM, shared memory issue, failure to create new threads due to OS limits), DVC reports the error that ERROR: failed to reproduce 'train': failed to run: ... and exits. While doing so, it deletes the temp directory in .dvc/tmp/exps/standalone/ and along with it all previously created checkpoints. I cannot find the same checkpoint objects in the .git/objects folder of the workspace and am unable to recover those checkpoints.

Reproduce

Create a dvc.yaml with train stage running a training script using DVCLive and checkpoints.
Execute stage with dvc exp run --temp train.
Wait a number of epochs until a few checkpoints were stored.
Kill training process with kill -9.
Check that .dvc/tmp/exps/standalone/tmpXXX folder is gone. No checkpoint objects in workspace (e.g., dvc exp show).

Expected

Checkpoints should be preserved to be able to recover from failures such as the ones mentioned in the description.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.31.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:

Supports:
        azure (adlfs = None, knack = 0.10.0, azure-identity = 1.11.0),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = None),
        hdfs (fsspec = None, pyarrow = 9.0.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = None, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7),
        webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git

Additional Information (if any):

When interrupting the experiment with CTRL+C, the training script is set up to still return a zero exit code such that DVC considers the experiment as successfully executed. In this case, I expect the checkpoints to be preserved before the temp directory is being deleted (but I haven't tested this yet).

The text was updated successfully, but these errors were encountered:

aschuh-hf · 2022-11-23T23:23:48Z

See also Discord discussion starting with this post.

aschuh-hf · 2022-11-23T23:42:14Z

For comparison, if one would not use DVC's checkpoint feature built on top of Git, checkpoints would be file objects only possibly with different filenames for each checkpoint (e.g., epoch or global step number). Upon failure, those file objects would still exist and could be used to resume training after resolving the underlying issue (e.g. free up system resources; after reboot of GPU server due to other reasons...). When switching to DVC checkpoints, the same should be possible. When executing an experiment directly in the workspace (i.e., without --temp flag), this should indeed be the case because the files in the .git folder are preserved.

Note that the actual .ckpt file objects are even in case of --temp still stored in the DVC cache because the cache is outside the .dvc/tmp/exps/standalone/tmpXXX temp folder. However, without the corresponding Git objects of the commits referencing these file objects in the cache (as well as the code diffs used to produce those), it is difficult to retrieve these.

pmrowla · 2022-11-25T05:10:59Z

I think this is the same as #8624. The issue here is that kill -9 sends SIGKILL and not SIGINT. The CTRL-C handler in your pipeline presumably only works for SIGINT.

You should get the expected/desired behavior if you did kill -2

aschuh-hf · 2022-11-25T07:43:47Z

It would. However, in this issue here, the signal is not triggered by me the user. It may be triggered by the OS due to OOM or because the user reached a system wide limit on number of threads. Or because of sudo shutdown issued by a system administrator, not the user having a long running training job going.

dberenbaum · 2022-11-28T21:08:08Z

@aschuh-hf This occurs for you with --temp but not --queue, correct?

aschuh-hf · 2022-11-28T22:58:51Z

I'm not sure, actually, and would have to test (but may not be able to soon).

The reported issue was encountered when using --temp.

dberenbaum · 2022-11-29T16:46:07Z

I can reproduce with --temp but not with --queue. Can we make --temp behave like --queue?

karajan1001 · 2022-12-01T11:24:25Z

The problem is in --queue

dvc/dvc/repo/experiments/queue/tasks.py

Lines 110 to 115 in aa2e830

    
           try: 
        
               cmd = ["dvc", "exp", "exec-run", "--infofile", infofile] 
        
               proc_dict = queue.proc.run_signature(cmd, name=entry.stash_rev)() 
        
               collect_exp.s(proc_dict, entry_dict)() 
        
           finally: 
        
               cleanup_exp.s(executor, infofile)()

If the run_signature was failed, the collect_exp and cleanup_exp will still be run.

while in --temp

dvc/dvc/repo/experiments/queue/workspace.py

Lines 117 to 134 in aa2e830

    
               if exec_result.ref_info: 
        
                   results[rev].update( 
        
                       self.collect_executor( 
        
                           self.repo.experiments, executor, exec_result 
        
                       ) 
        
                   ) 
        
           except CheckpointKilledError: 
        
               # Checkpoint errors have already been logged 
        
               return {} 
        
           except DvcException: 
        
               raise 
        
           except Exception as exc: 
        
               raise DvcException( 
        
                   f"Failed to reproduce experiment '{rev[:7]}'" 
        
               ) from exc 
        
           finally: 
        
               executor.cleanup(infofile) 
        
           return results

the collect_executor will only be run if the training is successful.

karajan1001 · 2022-12-06T10:40:59Z

First for this issue:

If something goes wrong during the --temp execution, Nothing is left in the exp show table
If something goes wrong during the --queue progress, only a failed experiment will be left ( No checkpoint results)

So, both --temp and --queue should push incomplete checkpoint results to the workspace.

We can fetch these checkpoints to the workspace just like in live training progress tracking in the vs-code extension. But it's better to complete the result info in the exec info file. But I got a problem during this progress. That is how to mark the task state for an incomplete checkpoint training task. It lefts some results in the workspace, but doesn't finish the plan.

My suggestion here is that we should fill the incomplete result to the exec file or we will have both incomplete results and a failed exp in the exp show table.

Another difficulty is the relationship between the experiment and the stage. Checkpoint training usually only belongs to one stage, and experiments might be composite by a chain of different stages. What should we do with the following stage if a checkpoint stage is interrupted?

I think we should stop after the checkpoint stage and fill the result into the exec file. But it's better to clean all of the output of all non-started stages, or we might return misleading metrics values to the user.

@pmrowla @dberenbaum Any thoughts about this?

fix: iterative#8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this

shcheklein · 2022-12-07T23:02:19Z

So, both --temp and --queue should push incomplete checkpoint results to the workspace.

@karajan1001 could you please clarify what do you mean by fetching the result to the workspace? Do we merge with the existing changes? Does it happen automatically? What if there multiple failed experiments?

How does it solve the problems that you outlined?

pmrowla · 2022-12-08T02:30:57Z

I think there is some confusion over what the desired behavior is right now. When using --queue, the current behavior is that any successful checkpoints will be preserved (and fetched into the main dvc repo/workspace from the temp execution workspace). If a later checkpoint iteration fails, we do not save that failed state, but the previous successful checkpoint iterations will still be available in the exp show table.

When using --temp we do not preserve the successful checkpoint iterations at all, due to the bug that @karajan1001 described: #8612 (comment)

My understanding is that the desired behavior here is to make --temp behave the same way as --queue for checkpoints. So successfully run iterations will still be available, but we do not actually need to save the failed/intermediate repo state of the final (unsuccessful/cancelled) iteration.

I think @karajan1001's latest question was regarding how to handle actually saving that failed/intermediate final state. In my opinion, this is not something we should be addressing right now, it would be better to handle it in the future if/when we are able to revisit checkpoint behavior in general.

But for now, I think limiting the scope of this issue to "make --temp behave consistently with --queue" is what we should be focusing on.

shcheklein · 2022-12-08T02:42:23Z

Thanks @pmrowla ! What I'm missing I guess is this part "and fetched into the main dvc repo/workspace ". Is it always the case or only when something fails? By workspace we mean actual workspace right?

pmrowla · 2022-12-08T02:51:07Z

Ah, what I should have said was maybe "fetched into git". Workspace in this case is really referring to "main dvc/git repo" vs "tempdir dvc/dvc repo" (which is used to run --queue and --temp exps outside of the main repo). I'm not talking about the actual local workspace.

Basically the issue is that we are losing the successful checkpoint iterations' git commits+exp ref that get generated by exp run --temp since we do not fetch them on failure (even though we do fetch them on failure for exp run --queue). This is not related to whether or not the changes actually get applied in the user's workspace directory afterwards

shcheklein · 2022-12-08T02:57:27Z

@pmrowla got it! thanks.

As for the failed experiments. What happens if a regular queued or temp (non checkpoint) exp fails? Do we show it in the table, is there a way to extract it?

pmrowla · 2022-12-08T03:05:21Z

Failed queued experiments are shown now as failed in the table and through the exp queue commands, but we are not saving any git commits for those failed exps, you just get a row showing which run failed (and you can now use queue logs to see the error logs as to why it failed)

But this only applies to --queue'd experiments.

fix: iterative#8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this

fix: #8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this

karajan1001 added bug A: experiments Related to dvc exp labels Nov 24, 2022

karajan1001 self-assigned this Nov 24, 2022

karajan1001 added this to DVC Nov 25, 2022

karajan1001 moved this to Backlog in DVC Nov 25, 2022

dberenbaum mentioned this issue Dec 2, 2022

queue: Ability to send different signals to running tasks #8624

Closed

karajan1001 mentioned this issue Dec 7, 2022

Make temp dir executed checkpoint experiment return result to workspace. #8668

Merged

2 tasks

omesser added bug Did we break something? and removed bug labels Dec 14, 2022

karajan1001 moved this from Backlog to Review In Progress in DVC Dec 20, 2022

skshetry closed this as completed in #8668 Dec 26, 2022

Repository owner moved this from Review In Progress to Done in DVC Dec 26, 2022

gregstarr mentioned this issue Nov 8, 2024

exp: show failed runs with --temp #10616

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: Checkpoints created during `dvc exp run --temp` run are lost after failure (e.g., `kill -9`) #8612

exp: Checkpoints created during `dvc exp run --temp` run are lost after failure (e.g., `kill -9`) #8612

aschuh-hf commented Nov 23, 2022

aschuh-hf commented Nov 23, 2022

aschuh-hf commented Nov 23, 2022 •

edited

Loading

pmrowla commented Nov 25, 2022 •

edited

Loading

aschuh-hf commented Nov 25, 2022 •

edited

Loading

dberenbaum commented Nov 28, 2022

aschuh-hf commented Nov 28, 2022

dberenbaum commented Nov 29, 2022

karajan1001 commented Dec 1, 2022

karajan1001 commented Dec 6, 2022 •

edited

Loading

shcheklein commented Dec 7, 2022

pmrowla commented Dec 8, 2022 •

edited

Loading

shcheklein commented Dec 8, 2022

pmrowla commented Dec 8, 2022 •

edited

Loading

shcheklein commented Dec 8, 2022

pmrowla commented Dec 8, 2022

exp: Checkpoints created during dvc exp run --temp run are lost after failure (e.g., kill -9) #8612

exp: Checkpoints created during dvc exp run --temp run are lost after failure (e.g., kill -9) #8612

Comments

aschuh-hf commented Nov 23, 2022

Bug Report

Description

Reproduce

Expected

Environment information

aschuh-hf commented Nov 23, 2022

aschuh-hf commented Nov 23, 2022 • edited Loading

pmrowla commented Nov 25, 2022 • edited Loading

aschuh-hf commented Nov 25, 2022 • edited Loading

dberenbaum commented Nov 28, 2022

aschuh-hf commented Nov 28, 2022

dberenbaum commented Nov 29, 2022

karajan1001 commented Dec 1, 2022

karajan1001 commented Dec 6, 2022 • edited Loading

shcheklein commented Dec 7, 2022

pmrowla commented Dec 8, 2022 • edited Loading

shcheklein commented Dec 8, 2022

pmrowla commented Dec 8, 2022 • edited Loading

shcheklein commented Dec 8, 2022

pmrowla commented Dec 8, 2022

exp: Checkpoints created during `dvc exp run --temp` run are lost after failure (e.g., `kill -9`) #8612

exp: Checkpoints created during `dvc exp run --temp` run are lost after failure (e.g., `kill -9`) #8612

aschuh-hf commented Nov 23, 2022 •

edited

Loading

pmrowla commented Nov 25, 2022 •

edited

Loading

aschuh-hf commented Nov 25, 2022 •

edited

Loading

karajan1001 commented Dec 6, 2022 •

edited

Loading

pmrowla commented Dec 8, 2022 •

edited

Loading

pmrowla commented Dec 8, 2022 •

edited

Loading