Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worktree add: dvc add removes valid metadata for files which are already tracked #8293

Closed
pmrowla opened this issue Sep 15, 2022 · 4 comments · Fixed by #8595
Closed

worktree add: dvc add removes valid metadata for files which are already tracked #8293

pmrowla opened this issue Sep 15, 2022 · 4 comments · Fixed by #8595
Assignees
Labels
A: cloud-versioning Related to cloud-versioned remotes A: data-management Related to dvc add/checkout/commit/move/remove A: data-sync Related to dvc get/fetch/import/pull/push p1-important Important, aka current backlog of things to do

Comments

@pmrowla
Copy link
Contributor

pmrowla commented Sep 15, 2022

dvc add always generates new stages, overwrites any existing stages and dumps a new .dvc file. For regular DVC workflows this is not a problem, but for cloud versioning/worktree remotes this causes a loss of metadata.

consider this workflow where I add one file and then push it:

$ dvc add foo
$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  path: foo
$ dvc push
1 file pushed
$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  path: foo
  version_id: '2022-09-15T07:45:55.1712903Z'
  etag: '0x8DA96EE52317788'

If I were to run dvc pull now, DVC would pull that specific version of the file (2022-09-15T07:45:55.1712903Z) from the remote.

If I now run dvc add without modifying foo, I get a .dvc file that does not contain any version/etag metadata, even though foo has not changed, and the metadata for the version I already pushed should still be valid:

$ dvc add foo

To track the changes with git, run:

        git add foo.dvc

$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  path: foo

Now if I run dvc pull, DVC will just pull the latest version of foo from the remote, which may not necessarily be the version that I initially pushed (since the .dvc file no longer contains a known version_id to pull)

Note that if I had modified foo, this metadata-clearing behavior is expected, since the newly modified version has not been pushed (and we do not know what the versionid/etag for that new version will be until we push).

The same problem applies to directories, where individual file entries that have not changed still lose metadata on dvc add.


For the worktree workflow, dvc add cannot just remove existing stages and re-write them. dvc add actually needs to check for outputs that are already tracked, and preserve the old metadata for file entries that have not changed.

For individual files, if the output hash (oid) has not changed, we need to preserve the existing metadata for the entire output.

For dirs, we need to do an actual tree merge, where we check the oids for each entry in the new tree. If we have meta for that oid in the pre-existing tree, it needs to be merged into the new one.

@pmrowla pmrowla self-assigned this Sep 15, 2022
@pmrowla pmrowla added A: data-sync Related to dvc get/fetch/import/pull/push A: data-management Related to dvc add/checkout/commit/move/remove labels Sep 15, 2022
@pmrowla pmrowla added the bug label Sep 15, 2022
@dberenbaum
Copy link
Collaborator

dberenbaum commented Sep 15, 2022

Now if I run dvc pull, DVC will just pull the latest version of foo from the remote, which may not necessarily be the version that I initially pushed (since the .dvc file no longer contains a known version_id to pull)

In this scenario where there is a cloud-versioned remote and no version id in the .dvc file, should dvc pull actually pull the latest version by default or should it fail? I would expect that pulling the latest version should be either a separate flag or command. Failing seems more consistent with the non-cloud-versioned workflow (dvc add followed by dvc pull will fail since the added data hasn't been pushed).

If version_aware is not enabled, I'm not sure how much sense this whole workflow makes, but I would expect dvc add && dvc pull to fail because the cloud data doesn't match the version info in the .dvc file. This is similar to doing dvc import-url --no-download followed by dvc pull. If the cloud object has changed, dvc pull will fail.

Maybe worth discussing with @dmpetrov .

@pmrowla
Copy link
Contributor Author

pmrowla commented Sep 16, 2022

I'm not sure what the behavior should be for fetch/pull with no etag or version ID, but that's a separate question from this issue. dvc add should not be modifying metadata for unchanged files, whether or not we allow fetching with no version ID

@dberenbaum
Copy link
Collaborator

Sorry, I missed that this was without modifying foo.

@dberenbaum dberenbaum added the A: cloud-versioning Related to cloud-versioned remotes label Sep 27, 2022
@dberenbaum dberenbaum added the p3-nice-to-have It should be done this or next sprint label Oct 12, 2022
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p3-nice-to-have It should be done this or next sprint labels Nov 17, 2022
@dberenbaum
Copy link
Collaborator

Related to #8354 and #8409. DVC shouldn't be modifying version_id for unchanged files.

@pmrowla pmrowla added this to DVC Nov 21, 2022
@pmrowla pmrowla moved this to Backlog in DVC Nov 21, 2022
@pmrowla pmrowla moved this from Backlog to In Progress in DVC Nov 21, 2022
Repository owner moved this from In Progress to Done in DVC Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes A: data-management Related to dvc add/checkout/commit/move/remove A: data-sync Related to dvc get/fetch/import/pull/push p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants