-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc dag --status : Showing the status of stages in mermaid graph #9288
Conversation
Now the mermaid dag tests are all passing. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #9288 +/- ##
==========================================
- Coverage 92.85% 92.72% -0.13%
==========================================
Files 459 459
Lines 37144 37198 +54
Branches 5356 5368 +12
==========================================
+ Hits 34490 34492 +2
- Misses 2122 2172 +50
- Partials 532 534 +2
... and 30 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. β View full report in Codecov by Sentry. |
dvc/commands/dag.py
Outdated
output = f"flowchart {direction}" | ||
|
||
if status: | ||
output += _get_class_defs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output += _get_class_defs() | |
output += "\n" + _get_class_defs() |
Looks like this is missing a newline.
Looks really cool as a first step towards better pipelines statuses! Played around with it a bit and worked well on my toy example. One question I have is how to show stages downstream of red? Green/orange is informative to tell you that it is up to date as an independent stage, but there is a good chance it will be out of date once the upstream stage is fixed. |
Yes, right. They are shown in grey to indicate them as "invalidated". There was just a small bug. I forgot to reverse the graph before invalidation. So right now the graph is a dependency graph (stage B is depending on stage A, so we have a directed edge B -> A). Another way to look at it is as a data flow graph (stage A gives data to stage B, so we have a directed edge A -> B). This means we either need to invalidate "upstream stages" in a dependency graph, or we reverse the graph first to get a data flow graph and then invalidate "downstream stages". The latter I just did my last commit now :) |
if validated is False and all( | ||
[outs_clean, outs_pushed, deps_clean, deps_pushed, command_run] | ||
): | ||
data["status"] = "grey" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use explicit status
names instead of color names and let the downstream representation map status
to color or whatever (i.e. _get_class_defs
in mermaid).
dvc/commands/dag.py
Outdated
dag_parser.add_argument( | ||
"--direction", | ||
choices=["LR", "TD"], | ||
default="TD", | ||
help=( | ||
"Direction of the rendered mermaid DAG. " | ||
"Can either be 'LR' for left-to-right or 'TD' for top-down'." | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to be opinionated on the representation.
It is hard to draw a boundary between what can be customized and what not.
In this case, I think it is not worth having an option because users could just:
dvc dag --md | sed 's/flowchart TD/flowchart LR/g'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it's a simple thing to change. Probably shouldn't bloat the options of this CLI.
Just some real life observation as in our pipeline we have a rather shallow pipeline (like 5-6 stages deep) but a large breadth (training data comes in as 15+ archives, as it is too large to handle otherwise & training and deployment stages are adapted for many different target architectures). So our DAGs look much prettier in left-to-right ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So our DAGs look much prettier in left-to-right ;)
I have nothing against making LR the default, though π
dvc/commands/dag.py
Outdated
dag_parser.add_argument( | ||
"--status-import", | ||
action="store_true", | ||
default=False, | ||
help="Check the dependencies of import stages. (Only compatible with --status)", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need this option.
What I do think its needed is to make the cloud status
optional, to mimic dvc status {-c}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the naming of the option is perfect, and maybe the default behavior should be inverted. But I do think it is quite relevant to have it in one way or another.
In our case, we are importing large datasets from a data repo into a model repo. If we didn't skip the dependency check for these import stages, instead of ~2 seconds, we would need ~10 minutes to generate the DAG with stage statuses.
I think, actually, checking the status of imported deps should be sped up in the first place. But for now this options gives us at least a way around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I do think it is quite relevant to have it in one way or another.
What I meant is that it makes more sense to me to enable/disable the entire cloud_status = repo.status(targets=target, cloud=True)
and not make it specific about imports
, if that makes sense. Use that condition to also skip the import
check as you do know.
In your case, there might be something especially slow about imports, but in other cases cloud_status
would be the part that takes significant time and people might want to make it optional.
So, make it a more straightforward choice for the user: whether to check the remote storage (cloud) or not. Without getting into DVC implementation details like deps
/imports
.
instead of ~2 seconds, we would need ~10 minutes to generate the DAG with stage statuses.
I think, actually, checking the status of imported deps should be sped up in the first place.
Could you open an issue sharing some more details on your setup? Those numbers look like too much time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that it makes more sense to me to enable/disable the entire
cloud_status = repo.status(targets=target, cloud=True)
and not make it specific aboutimports
, if that makes sense. Use that condition to also skip theimport
check as you do know.
Yes, that totally makes sense, and I think it a -c
flag should be included here. For the reason you mentioned.
But, actually I think there are two things going on here.
As I say in #9304, calling dvc status
is bad, when having multiple large imports. But, what is even worse in this PR, is where we basically do this:
from dvc.repo import Repo
repo = Repo(".")
for stage in repo.index.stages:
for dep in stage.deps:
dep.status()
Calling the status on the deps of an import stage is horribly slow compared to non-import stages. I'm not sure what could be a smoothest way to get the best of both worlds. Being able to check the the remote storage status, but not having to wait for minutes because you are using imports. π€· With having both flags -c
and --skip-import-deps
(as I have now renamed it in 52503fd) it would be possible. But I totally understand, it's desirable to keep the option list short and not confuse users.
One alternative I could think of, is to keep the CmdDAG
class like it was before this PR and introduce a new class CmdDAGStatus
that can have all the extra fluff π
Thanks again for this PR @ostromann! I think it's basically a visualization of I'm wondering if we should consider making this (or maybe the whole What do you think? |
I completely agree!
Please don't move the But otherwise, I fully agree. This visualization could fit nicely into | Not included in this PR yet, but on the same data that is extracted in this PR, I had also made a tabular overview, which I think could fit into This is how it looks an entire pipeline: And this is how it looks for individual stages (could be suitable for Let me know if this would be of interest. |
Sounds good @ostromann! Maybe we could have a status format that looks similar but is a simpler CLI table. |
Thanks for the PR! We haven't been able to prioritize this on our end, and it looks like it's not passing checks and has no tests, so closing this, but feel free to reopen if you want to work on updates. |
β I have followed the Contributing to DVC checklist.
[] π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
This PR adds a feature to
dvc dag
to also show the status of stages in the DAG in mermaid formatting. It addresses #9287 and consequently #iterative/studio-support#81, #iterative/studio-support#59, #iterative/studio-support#40. #5369 is addressed partially, and could easily be reachable from here, in my perspective.This PR adds a few new options to
dvc dag
, namely being:--direction {TD, LR}
: Allows to print the mermaid top-down or left-right orientation.--status
: Checks the status of all stages in the DAG and adds boolean attributes to them indicating:deps clean
: deps are unmodified and present in the local repodeps pushed
: deps are pushed to the remoteouts clean
: outs are unmodified and present in the local repoouts pushed
: outs are pushed to the remotecommand run
: stage has an unmodified command, which has run beforevalidated
: stage is valid, i.e. it is not invalidated by upstream stages that contain any changes--status-import
: Also checks the status of deps of import stages. I decided to separate it, in order to avoid waiting times for checking out imports from other repos.The output adds several class definition that add styling to the mermaid graph and color the nodes accordingly. Supported colors are:
dvc push
dvc repro
The blue stages get an additional colored border if they are not clean.
Right now the tests are still failing, which is due to a different ordering of the nodes. I reorganized the the generation of the mermaid output a bit to improve readability. However, I can't seem to figure the sorting of nodes to make it compatible with the previous output right now. Maybe you guys want to take over from here. π
Example image: