exp: show failed runs with `--temp` #10616

gregstarr · 2024-11-08T16:53:52Z

Bug Report

Description

experiments run with --temp that fail appear to go missing. Sounds like this was an issue previously but was fixed, however I still am having this problem. I am using a recent version of DVC and the previous issue was from 2 years ago.

#8612

Reproduce

dvc exp run --temp
dvc exp show properly lists experiment
first stage success
second stage fail
dvc exp show doesn't show the failed experiment

Expected

dvc exp show shows the failed experiment
if for some reason it shouldn't show the failed experiment, then note in the docs because this seems like a fairly significant behavior difference between running with the queue and with temp

Environment information

Output of dvc doctor:

DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.11.10 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 3.16.6
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.8
Supports:
        http (aiohttp = 3.10.8, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.8, aiohttp-retry = 2.8.3)
Config:
        Global: /home/starrgw1/.config/dvc
        System: /etc/xdg/dvc

The text was updated successfully, but these errors were encountered:

gregstarr · 2024-11-08T17:34:32Z

Here is the experiment that failed that I can't find, this is the JSON from the tmp/exps/run directory:

{
  "git_url": "path/.dvc/tmp/exps/standalone/tmpqkp9v9u1",
  "baseline_rev": "ce7991feda5aabbfa18d9541c5d90191850b7b09",
  "location": "tempdir",
  "root_dir": "path/.dvc/tmp/exps/standalone/tmpqkp9v9u1",
  "dvc_dir": ".dvc",
  "name": "mean_depth_norm",
  "wdir": ".",
  "result_hash": null,
  "result_ref": null,
  "result_force": false,
  "status": 1
}

gregstarr · 2024-11-08T17:59:20Z

Wow I can git cat-file the sha from the name of the json file:

> tree a55df7458541c3aeecc816840830de454d66472d
parent ce7991feda5aabbfa18d9541c5d90191850b7b09
parent 1e3c6c8e652abbaf9fd1f47a63cdeadb4e8b5697
author starrgw1 email 1730829745 +0000
committer starrgw1 email 1730829745 +0000

dvc-exp:ce7991feda5aabbfa18d9541c5d90191850b7b09:ce7991feda5aabbfa18d9541c5d90191850b7b09:mean_depth_norm

And in fact I can apply the commit through either git or DVC. However, it doesn't look like I can get back the stage 1 results? dvc.lock seems unchanged, so nothing gets checkout out.

shcheklein · 2024-11-08T22:39:07Z

@gregstarr could you share a script / project to reproduce this?

gregstarr · 2024-11-09T00:20:54Z

Here is a minimal example: https://github.com/gregstarr/minimum-dvc

clone
install a recent version of dvc
dvc exp run --temp --name "exp1"
dvc exp show -A

the third command should fail in the third stage, then the fourth command should't show the experiment "exp1". I would expect dvc exp show -A to show the failed experiment. I would also think the output of stage 1, which finished successfully, should be in the cache.

shcheklein · 2024-11-11T01:04:50Z

Thanks @gregstarr , from what I see it was indeed discussed here #8612 (comment) . I see that we completely drop the directory with failed --temp experiments and we don't collect them in show. @pmrowla do you remember by chance - is it just non-yet-implemented functionality or it was by design for some reason (e.g. we drop directories to save space and thus we don't show them since users won't be able to restore / apply them?)

pmrowla · 2024-11-16T01:17:55Z

IIRC this is expected behavior for --temp runs. Failed experiments are only collected when run via the queue. The failure state for a queued experiment run is tied to the celery/dvc-task job for that run, but --temp runs don't touch celery at all.

In the codebase you can see that queue runs have an implementation for collect_failed_data

dvc/dvc/repo/experiments/queue/celery.py

Line 569 in 1e08cc5

def collect_failed_data(

but workspace runs do not (and --temp runs extend the workspace implementation)

dvc/dvc/repo/experiments/queue/workspace.py

Lines 250 to 255 in 1e08cc5

    
           def collect_failed_data( 
        
               self, 
        
               baseline_revs: Optional[Collection[str]], 
        
               **kwargs, 
        
           ) -> dict[str, list["ExpRange"]]: 
        
               raise NotImplementedError

shcheklein · 2024-11-16T01:21:23Z

@pmrowla yep, thanks. But what was the reason for this? (if you remember :) )

pmrowla · 2024-11-16T02:50:43Z

It's leftover from --temp predating the queue. Workspace runs have never had any kind of saved execution state (other than the git commit on success), and --temp was originally just the extension of "do exactly what we do in a workspace run except in an isolated directory".

gregstarr · 2024-11-18T19:00:11Z

I have a shell script which calls dvc exp run --temp, which is nice because the command blocks until the pipeline completes. How would I replicate this behavior using the queue so that I can see when experiments fail and examine the failed state?

shcheklein · 2024-11-18T19:07:20Z

@gregstarr how about analyzing the results of the dvc queue status command?

(I wonder if we should just do dvc queue wait or something)

gregstarr · 2024-11-18T20:49:17Z

so have python or bash call dvc queue status periodically and check to see if the queue is empty?

shcheklein · 2024-11-18T21:05:32Z

yep, something like that

gregstarr · 2024-11-18T21:41:19Z

OK I think I will just continue to use --temp. Even though failures aren't captured, I like the fact that it is simple and output isn't captured by any intermediary i.e. the queue.

Is this issue an easy fix or pretty complicated? If it's going to be a while, what do you think about adding a note in the docs mentioning difference in behavior when using --temp?

dvc queue wait sounds great!

shcheklein added A: experiments Related to dvc exp triage Needs to be triaged labels Nov 8, 2024

shcheklein added feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint and removed triage Needs to be triaged labels Nov 16, 2024

shcheklein changed the title ~~exp: failed runs with --temp go missing~~ exp: show failed runs with --temp Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: show failed runs with `--temp` #10616

exp: show failed runs with `--temp` #10616

gregstarr commented Nov 8, 2024

gregstarr commented Nov 8, 2024 •

edited

Loading

gregstarr commented Nov 8, 2024

shcheklein commented Nov 8, 2024

gregstarr commented Nov 9, 2024 •

edited

Loading

shcheklein commented Nov 11, 2024

pmrowla commented Nov 16, 2024

shcheklein commented Nov 16, 2024

pmrowla commented Nov 16, 2024

gregstarr commented Nov 18, 2024

shcheklein commented Nov 18, 2024

gregstarr commented Nov 18, 2024

shcheklein commented Nov 18, 2024

gregstarr commented Nov 18, 2024

exp: show failed runs with --temp #10616

exp: show failed runs with --temp #10616

Comments

gregstarr commented Nov 8, 2024

Bug Report

Description

Reproduce

Expected

Environment information

gregstarr commented Nov 8, 2024 • edited Loading

gregstarr commented Nov 8, 2024

shcheklein commented Nov 8, 2024

gregstarr commented Nov 9, 2024 • edited Loading

shcheklein commented Nov 11, 2024

pmrowla commented Nov 16, 2024

shcheklein commented Nov 16, 2024

pmrowla commented Nov 16, 2024

gregstarr commented Nov 18, 2024

shcheklein commented Nov 18, 2024

gregstarr commented Nov 18, 2024

shcheklein commented Nov 18, 2024

gregstarr commented Nov 18, 2024

exp: show failed runs with `--temp` #10616

exp: show failed runs with `--temp` #10616

gregstarr commented Nov 8, 2024 •

edited

Loading

gregstarr commented Nov 9, 2024 •

edited

Loading