Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: clean up output for changed files #2982

Closed
7 tasks
dmpetrov opened this issue Dec 19, 2019 · 31 comments · Fixed by #3229
Closed
7 tasks

diff: clean up output for changed files #2982

dmpetrov opened this issue Dec 19, 2019 · 31 comments · Fixed by #3229
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension research ui user interface / interaction

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Dec 19, 2019

Context:

$ dvc -V
0.77.3+e03136
# all txt files are data files
$ ls
newfile.txt     small.txt       ttt.txt
newfile.txt.dvc small.txt.dvc   ttt.txt.dvc
$ echo qwert >> small.txt
$ dvc add small.txt
$ git add small.txt.dvc
$ git commit -m hello
$ dvc diff HEAD^
dvc diff from ada9d97 to 1a22314 .  # <<-- "diff ada9d97..1a22314" might be enough

diff for 'small.txt'
-small.txt with md5 5f73f210b6202aa07279cdff6776b64f
+small.txt with md5 b4301ef665d0343e028fcd5821654edc

added file with size 6 Bytes .  # <<-- File was NOT added, it was modified. 6 bytes is diff, not size.

diff for 'ttt.txt'
-ttt.txt with md5 92a47ab38589702d290647ce24386181
+ttt.txt with md5 92a47ab38589702d290647ce24386181

file size was not changed .  # <<-- what is a reason to output this then? Also, is it about the previous diff or the next one - hard to read?

diff for 'newfile.txt'
-newfile.txt with md5 99370a5386cac00c93c9fdc836076a7d
+newfile.txt with md5 99370a5386cac00c93c9fdc836076a7d

file size was not changed  # <<-- what is a reason to output it then?

Also, it would be great to simplify this output for parsing.

Possible "clean" output:

$ dvc diff HEAD^
diff ada9d97..1a22314
Changed 'small.txt' +6 bytes
md5 5f73f210b6202aa07279cdff6776b64f b4301ef665d0343e028fcd5821654edc
New 'fakefile.txt' +13728494 bytes
md5 92a47ab38589702d290647ce24386181 99370a5386cac00c93c9fdc836076a7d
Removed 'fakefile_2222.txt' +8326436 bytes
md5 e0313660cfc07a10417543b8e4d08bea9 f4daf3bacc9a9494df7d641f572e92b66

Actions:

  • Not show not-changed files (at least if it was not specified explicitly).
  • Clearly distinguish added files and modified files - with the sizes.
  • Improve output for reading. not empty lines within the context of a single file (preferably not empty lines at all). See diff: clean up output for changed files #2982 (comment).
  • Simplify the format for parsing. See --porcelain

EDIT 12/21/19:

The same requirements should apply to directories:

$ dvc diff -t dir1 HEAD^
dvc diff from 025a8b3 to 41d8e27

diff for 'dir1'
-dir1 with md5 7492f47e0a1908a80d942a01702e6b8b.dir
+dir1 with md5 8f8824fbd7b1107ef69c85ab5db82973.dir

4 files untouched, 0 files modified, 1 file added, 0 files deleted, size was increased by 37 Bytes

Actions:

  • Add an option --porcelain (like in Git) or --to-json to make output format easy-to-parse for scripts.

EDIT 12/24/19:

  • Do not print checksum by default. A separate option might be needed.
  • Support files inside data directories.
@dmpetrov dmpetrov added bug Did we break something? ui user interface / interaction labels Dec 19, 2019
@efiop efiop added the p2-medium Medium priority, should be done, but less important label Dec 19, 2019
@dmpetrov dmpetrov added product: VSCode Integration with VSCode extension and removed product: VSCode Integration with VSCode extension labels Dec 22, 2019
@dmpetrov
Copy link
Member Author

Another request just came #770 (comment) The request is related to data dirs - dvc diff does not go deeper in data dirs. This can be considered as a part or an enhancement of this issue.

Also, it seems like data file checksums should not be shown by default. Let's prioritize human readability like in Git. An option can add checksums.

The output could look like:

$ dvc diff HEAD^
diff ada9d97..1a22314

New:            'fakefile.csv' 751 Mb
New:            'fakefile.txt' 2 Kb

Removed:        'fakefile_2222.txt' 536Kb

Modified:       'small.txt' 36Kb (+6 bytes)
Modified:       'hello.txt' 351 Mb (+12 Mb)
Modified:       'hello.csv' 631 Mb (-3.4 Mb)
Modified dir:   'images'. 116 files untouched, 2 modified, 1 added, 1 deleted. 
      New:         'DM3SZ.jpg` 2.1 Mb
      Modified:    'B8PZM.jpg' 2.5 Mb (+158 Kb)
      Modified:    'T79JK.jpg' 1.7 Mb (-0.2 Mb)
      Removed:     '4USKW.jpg' 2.6 Mb

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 24, 2019

$ dvc diff HEAD^
diff ada9d97..1a22314
Changed 'small.txt' +6 bytes
md5 5f73f210b6202aa07279cdff6776b64f b4301ef665d0343e028fcd5821654edc
...

How about just using the first 7 digits, similar t a short SHA?

md5 5f73f21 b4301ef

or the rel. path in cache

path 5f/73f210b6202aa07279cdff6776b64f b4/301ef665d0343e028fcd5821654edc

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 24, 2019

or nothing (unless a special flag is used), like in the latest example:

$ dvc diff HEAD^
diff ada9d97..1a22314

New:            'fakefile.csv' 751 Mb

Removed:        'fakefile_2222.txt' 536Kb

Modified:       'small.txt' 36Kb (+6 bytes)
Modified dir:   'images'. 116 files untouched, 1 modified, 1 added, 1 deleted. 
      New:         'DM3SZ.jpg` 2.1 Mb
      Modified:    'B8PZM.jpg' 2.5 Mb (+158 Kb)
      Removed:     '4USKW.jpg' 2.6 Mb

Other questions about ^ that example:

  • Should there be new lines between New, Modified, and Removed inside directories (indented levels) or vice versa, no new lines in he root level? For consistency
  • Should there be a --recursive/-R option with default value of 1 to avoid crazy long diffs? In that case only "files untouched, modified, added, deleted" would be printed for dirs in the last level.
  • Should the root diff also print the "files untouched, modified, added, deleted" summary on top? (With this you could use --recursive=-1 to only print the summary, keeping the output super short.)
  • I would rec. GNU-style -h human readable file size units (e.g. with humanize.naturalsize(bytes, gnu=True) from this pak.
$ dvc diff -R 2 HEAD^
diff ada9d97..1a22314
0 files/dirs untouched, 2 modified, 1 added, 1 deleted. 
New:            'fakefile.csv' 751 M
Removed:        'fakefile_2222.txt' 536 K
Modified:       'small.txt' 36K (+6 bytes)
Modified dir:   'images'. 116 files/dirs untouched, 7 modified, 1 added, 1 deleted. 
    New:         'DM3SZ.jpg` 2.1 M
    Modified:    'B8PZM.jpg' 2.5 M (+158 K)
    Removed:     '4USKW.jpg' 2.6 M
    Modified dir:   'faces'. 0 files/dirs untouched, 1 modified, 1 added, 0 deleted. 
        New:         'pokerface.jpg` 3.7 M
        Modified dir:   'animals'. 0 files untouched, 0 modified, 100 added, 0 deleted. 
    Modified dir:   'faces'. 3 dirs untouched, 1 modified, 1 added, 1 deleted. 
    Modified dir: ...
    ...

$ dvc diff -R -1 HEAD^
diff ada9d97..1a22314
0 files or directories untouched, 2 modified, 1 added, 1 deleted. 

Also, do we plan to recognize when "1 added, 1 deleted" means that a file was moved, and count it as a single change instead? Like Git does some times. This is probably overkill.

@dmpetrov
Copy link
Member Author

@jorgeorpinel thank you for the questions - all are great points!

  • Should there be new lines between New, Modified, and Removed inside directories (indented levels) or vice versa, no new lines in he root level? For consistency

Well... possibly. But instead, I'd remove all the new lines for consistency. See git status might be an example.

  • Should there be a --recursive/-R option with default value of 1 to avoid crazy long diffs? In that case only "files untouched, modified, added, deleted" would be printed for dirs in the l ast level.

-R 1 does not look natural. It would be great to see examples from other diff-like commands.

Thank you for bringing this. The problem that I see now - all these spaces as prefixes won't look good at 3rd+ level. We should think about gettig rid of the space-prefixes. It might look like:

$ dvc diff HEAD^
diff ada9d97..1a22314
New:            'fakefile.csv' 751 Mb
Removed:        'fakefile_2222.txt' 536Kb
Modified:       'small.txt' 36Kb (+6 bytes)
Modified dir:   'images'. 116 files untouched, 1 modified, 1 added, 1 deleted. 
New:            'images/DM3SZ.jpg` 2.1 Mb
Modified:       'images/B8PZM.jpg' 2.5 Mb (+158 Kb)
Removed:        'images/4USKW.jpg' 2.6 Mb
New:            'images/outliers/NRD8E.jpg` 1.3 Mb
New:            'images/outliers/W84N2.jpg` 2.0 Mb
  • Should the root diff also print the "files untouched, modified, added, deleted" summary on top? (With this you could use --recursive=-1 to only print the summary, keeping the output super short.)

I was thinking about the summary only for data dirs. I don't think it is needed for a regular ones and for dirs inside data dirs (see files in images/outliers/ above - no summary). --recursive=-1 - might be an option - is there any examlpes?

  • I would rec. GNU-style -h human readable file size units (e.g. with humanize.naturalsize(bytes, gnu=True) from this pak.

👍 We might show no sizes by default and the numbers with -h.

@jorgeorpinel
Copy link
Contributor

--recursive=-1 - might be an option - is there any examlpes?

My example was

$ dvc diff -R -1 HEAD^
diff ada9d97..1a22314
0 files or directories untouched, 2 modified, 1 added, 1 deleted. 

But you also said

I was thinking about the summary only for data dirs.

Which makes sense, so not sure we want to provide this after all.

-R 1 does not look natural. It would be great to see examples from other diff-like commands.

git diff for example can print very, very long diffs when there's lots of changed files, and it does it without indentation like you suggest (with absolute paths). No recursion/level option is available. (It does have some other options to make the output smaller like --minimal and --name-only though.)

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 3, 2020

@jorgeorpinel I got your point and -R might be a good option. But I was asking about examples in some popular unix tools like diff, ls, grep, git or similar.

Do you have any examples like this? It's okay if not.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 3, 2020

Ah no. I don't think -R '-1' exists anywhere haha. I guess it's a little strange. We could also just shift it up to 0 and make the default 1. Just some thoughts about the recursive option but these refinements are probably secondary to the main behavior and could have their own separate issue (lower priority).

@dmpetrov dmpetrov added the product: VSCode Integration with VSCode extension label Jan 13, 2020
@ghost ghost self-assigned this Jan 17, 2020
@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 25, 2020

I had a great chat with @MrOutis re this issue. I'm putting down a summary...

(1) Agreed to use this-like format structure:

Added:
firmograph.txt
src/dvc.ico
data/dir1/foo.txt
data/dir1/bar.csv

Modified:
source.csv
data/                   <<--- note, this is a data dir and this is why it is in output.
data/lorem.txt
data/ipsum.txt
data/dir1/somefile.txt
data/dir1/file2.txt

Deleted:
dir/                    <<--- note, this is a data dir and this is why it is in output.
dir/file1.txt

(2) Json output should reflect this structure with "added", "modified", "deleted" on the top and list of changes. For data directories Json also needs to include "is_dir": True signal or just / suffix in name "name": "dir1/" or both.

(3) File sizes are not needed now. To solve the file sizes problem holistically we need to include file sizes in dvc-files (when we compute checksums) and then reuse it in other commands including diff.

(4) Checksums are not needed by default. But it would be great to have an option like --show-checksum to include them into the output and json.

(5) It would be great to see a diff between HEAD and the current workspace by default (like in Git).

(6) [We haven't' discussed this but it might be still important] We probably need a header like in the first line of output diff 43168e0..3f7ad04 as well in json "diff" {"from": "43168e07a3f2d5ea86eca8e1b1c926f90229d9fb", "to": "3f7ad041875321fee33dead9b9b6b073b139b943"}.

@MrOutis thanks for the discussion and please comment if you have something to add.

@ghost
Copy link

ghost commented Jan 27, 2020

(6) [We haven't' discussed this but it might be still important] We probably need a header like in the first line of output diff 43168e0..3f7ad04 as well in json "diff" {"from": "43168e07a3f2d5ea86eca8e1b1c926f90229d9fb", "to": "3f7ad041875321fee33dead9b9b6b073b139b943"}.

@dmpetrov , what Git does is to show the different checksums from each file, not the trees that you are comparing.

Repeating the arguments at the output seems excessive, since you already know where the diff comes from.

Adding such information to the JSON output is a one-liner. Let's add it just if it is needed 🙂

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 27, 2020

note, this is a data dir and this is why it is in output.

Is this a note for us here or an actual part of the output? Not sure I get it either way.

I like this proposed structure and format but this new direction for dvc diff looks more like git status than git diff. Should we be talking about dvc status in fact?

see a diff between HEAD and the current workspace by default

Yes! But again, kind of like git status?

p.s. I see this latest format is consistent with #2995 (comment) too, which is nice. I like consistency 🙂

@dmpetrov
Copy link
Member Author

Yes! But again, kind of like git status?

@jorgeorpinel good point about dvc status. In Git, status and diff do similar things but diff is focused on file content and versions/checksums (HEAD^^) while status is about workspace/staging state. We are not showing the content diff in DVC, so the difference is blurring and there is a chance that we are mixing these commands a bit.

What we need in this functionality is checksum1 vs checksum2 difference. So, it is closer to git diff from this point of view.

If we decide to incorporate dvc diff in dvc status it will require command syntax/params change and we might lose the connection with the Git commands. I'd suggest keeping dvc diff as a separate command for now and later we can merge them we find a strong reason.

Is this a note for us here or an actual part of the output? Not sure I get it either way.

It is the actual output. We just don't want to lose the information about data/not-data directory that might be important for UI.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 29, 2020

What we need in this functionality is checksum1 vs checksum2 difference. So, it is closer to git diff from this point of view.

But actually I believe dvc diff is currently being reworked to behave like the new dvc metrics diff, where by default it compares HEAD vs the working tree (local changes) @dmpetrov. (Per #2999? Not sure)

I'd suggest keeping dvc diff as a separate command for now and later we can merge them we find a strong reason.

I agree totally! My suggestion is that maybe we should be reviewing the behavior of dvc status and not that of dvc diff here.

It is the actual output.

OK but I don't get the meaning of that note. Does it mean e.g. data/ was not added to DVC as a directory, but it's being shown because files inside were? If you can explain I can suggest another phrasing for these short notes. Cc @MrOutis

@dmpetrov
Copy link
Member Author

@jorgeorpinel I've created #3255 for discussing diff vs status and possible merge.

Does it mean e.g. data/ was not added to DVC as a directory, but it's being shown because files inside were? If you can explain I can suggest another phrasing for these short notes. Cc @MrOutis

You are right. If anything was change inside a data dir - the dir will change and we don't want to lose this information (from output).

@dmpetrov
Copy link
Member Author

Also created #3256 re the file sizes.

@efiop efiop added p1-important Important, aka current backlog of things to do and removed p0-critical labels Jul 13, 2020
@dmpetrov
Copy link
Member Author

In addition to this... it seems like missing cache files are not handled properly:

$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
$ dvc diff
Deleted:
    data/data.xml
    data/features/
    data/prepared/
    model.pkl

files summary: 0 added, 2 deleted, 0 modified

# But there are no deleted files. Just missed caches.
$ dvc pull
$ dvc diff HEAD
# no diff - an expected result

Ideally, DVC should get the difference without cache by using just metafiles. An exception is data dirs (from cache). Data dirs should be pulled automatically during dvc diff (if there are no special options provided).

@dmpetrov
Copy link
Member Author

dmpetrov commented Jul 13, 2020

Related to this issue - we should store file size in addition to hashes. It will give us an ability to show file sizes in dvc diff.

Actions for dvc diff:

@pmrowla
Copy link
Contributor

pmrowla commented Sep 3, 2020

Ideally, DVC should get the difference without cache by using just metafiles. An exception is data dirs (from cache). Data dirs should be pulled automatically during dvc diff (if there are no special options provided).

Just the clarify this, is the desired behavior here to fetch and checkout a data directory if the entire directory is missing from local cache (but only if the entire directory is uncached)?

The reason I'm asking is because if we get into behavior where diff checkout's files by default, it can become unclear when a file is missing (and should be pulled/checked out) vs when a file is actually deleted.

ex if diff always pulls data dirs by default, assuming we have a dvc-tracked data directory containing [dir/file1, dir/file2]

$ dvc diff
# pulls contents of dir/ by default
# no changes (expected)

$ rm dir/file1
$ dvc diff
# automatic pull would re-checkout dir/file1
# no changes (unexpected)

In this scenario presumably dir/file1 should be considered deleted and not missing.

In the second diff command, we could assume that dir/ is not supposed to be pulled/checked out since the local cache entries exist at that point (since they were fetched/pulled by the first diff). But diff pulling files at all seems a bit unintuitive to me.

@dmpetrov

@pmrowla
Copy link
Contributor

pmrowla commented Sep 4, 2020

After discussion with @dmpetrov, we decided that diff should be aware of an additional "not in cache" file state (like dvc status). So for the example-get-started scenario where nothing has been pulled yet, we would show that 0 files have been deleted, but several files are not in cache (and still need to be pulled). This state will be displayed by default, but can be hidden by a flag.

@pmrowla
Copy link
Contributor

pmrowla commented Sep 7, 2020

Bugs for this issue have been resolved, there is an existing separate issue for the file size related changes: #3256

@pmrowla pmrowla closed this as completed Sep 7, 2020
@jorgeorpinel
Copy link
Contributor

Hi!

additional "not in cache" file state (like dvc status)

So are the states the same for status and for diff? Is it "not in cache" or is it "missing" (neither in cache nor in remote when -rc are used)?

@pmrowla
Copy link
Contributor

pmrowla commented Sep 9, 2020

@jorgeorpinel The state for diff is "not in workspace and not in local cache". We don't check anything in remotes for diff other than trying to fetch dir cache (list of files that are supposed to be in a directory output) from a default remote (and that is only fetched if it is not already present in local cache).

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 9, 2020

Right! I was confused. I think let's try not to mix the concept of status states and diff "state" (wouldn't even call them that for diffs) to avoid these confusions. Won't be easy to explain that way in docs at least, as I commented in your docs PR 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension research ui user interface / interaction
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants