Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: clarify config options #8409

Closed
dberenbaum opened this issue Oct 7, 2022 · 4 comments · Fixed by #8634
Closed

cloud versioning: clarify config options #8409

dberenbaum opened this issue Oct 7, 2022 · 4 comments · Fixed by #8634
Assignees
Labels
A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Collaborator

Following up on #8354 (comment), the worktree config option for remotes should not actually checkout the latest version on the remote.

However, we still need a separate checkout remote config option to checkout the latest version on the remote. For example, this enables a data registry where the consumers don't need DVC or Git. They can browse the latest version in the cloud knowing it is updated, while the data registry producers can use Git and DVC to version and rollback datasets in the registry as needed.

The checkout option should make duplicates to restore old file versions, and it should add delete markers to reflect data that was deleted in the workspace. However, the checkout option should not change the version_id or other info in the .dvc files. Adding the version_id should happen separately with the worktree option enabled, and the checkout should optionally happen after, only making changes on the remote (nothing in the workspace or .dvc files).

@dberenbaum dberenbaum added the A: cloud-versioning Related to cloud-versioned remotes label Oct 7, 2022
@dberenbaum dberenbaum changed the title cloud versioning: checkout config option cloud versioning: clarify config options Oct 10, 2022
@dberenbaum
Copy link
Collaborator Author

After discussion with @efiop, we decided that instead of adding yet another option, we should modify what the existing options do:

version_aware - means remote with cloud versioning (aka just storage)
[+]worktree - means remote workspace (proper checkout, mirroring your local workspace)

@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Oct 10, 2022
@dberenbaum
Copy link
Collaborator Author

Let me better explain the purpose of this ticket: it's about clarifying what each option does and more importantly simplifying the workflow around .dvc files and what should be written to them. It's not about reducing duplicate file copies on the cloud.

Right now, when I try to use cloud versioning, the behavior seems unpredictable and breaks easily. Sometimes new version ids are added to my .dvc files unexpectedly, even for files I have not touched (see #8354).

I think we can make this simpler and more predictable:

  1. If there is already a version_id for a path, it is in the remote and pushing again should not change the .dvc file.
  2. If the path has been modified locally since the last push, the .dvc file will not have a version_id.
  3. If there is not a version_id, pushing will upload to the remote and add a version_id to the .dvc file.

Pushing a file that already has a version_id can either do nothing or it can restore that copy. From the last comment above, I suggest we use the worktree flag to specify whether to restore the copy, but we can consolidate or make it default. The important part is that restoring an old version does not update the .dvc file, which will simplify the Git workflow.

@pmrowla pmrowla self-assigned this Nov 23, 2022
@pmrowla pmrowla added this to DVC Nov 23, 2022
@pmrowla pmrowla moved this from Backlog to Todo in DVC Nov 23, 2022
@pmrowla pmrowla moved this to Backlog in DVC Nov 23, 2022
@pmrowla pmrowla moved this from Todo to In Progress in DVC Nov 23, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Nov 24, 2022

@dberenbaum if we want worktree to reflect the latest push (including delete flags) but do not want to update existing version IDs in the .dvc file (to avoid merge conflicts), this will have the side effect of always pushing new duplicate copies of that file on dvc push for worktrees as soon as you get a situation where someone pushes from a different branch.

My .dvc file will only ever contain my branch's old version id, and the latest version on S3 will never match. Even after I push to force the remote to contain my copy of the file (and not the other modified branch's copy), the latest version ID on s3 will now be different than what is in my .dvc file. So if I run dvc push again, even when nothing else has changed locally and nothing has changed on S3, the version id's won't match so DVC will have to assume that the file needs to be pushed.

I don't think it's possible to have worktree always reflect the latest push without updating version IDs in .dvc files (which will cause merge conflicts), at least until we have support for some kind of database mapping DVC md5's to known version IDs (since at that point we will be able to check if we have an MD5 for the "latest" s3 version, and we can see that the latest md5 matches the md5 for our "old" version in the .dvc file)

I suppose we can try checking etags of the latest vs our version and skip pushing if the etags match between our old version ID and the latest version ID, but etag's are not actually guaranteed to reliably work this way, especially for large files (where it depends on how the file was chunked for a multipart upload).

@dberenbaum
Copy link
Collaborator Author

@pmrowla It will create lots of duplicates, but I think that's considered acceptable for now. The main requirements are:

  1. Mirror the workspace so that other tools/users without DVC can use the dataset.
  2. Make it possible to recreate a version of the dataset from the .dvc files.

The proposed optimizations make sense to me but I don't they are required for now.

@pmrowla pmrowla moved this from In Progress to Review In Progress in DVC Nov 28, 2022
Repository owner moved this from Review In Progress to Done in DVC Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants