-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud versioning: clarify config options #8409
Comments
After discussion with @efiop, we decided that instead of adding yet another option, we should modify what the existing options do:
|
Let me better explain the purpose of this ticket: it's about clarifying what each option does and more importantly simplifying the workflow around .dvc files and what should be written to them. It's not about reducing duplicate file copies on the cloud. Right now, when I try to use cloud versioning, the behavior seems unpredictable and breaks easily. Sometimes new version ids are added to my .dvc files unexpectedly, even for files I have not touched (see #8354). I think we can make this simpler and more predictable:
Pushing a file that already has a version_id can either do nothing or it can restore that copy. From the last comment above, I suggest we use the |
@dberenbaum if we want My .dvc file will only ever contain my branch's old version id, and the latest version on S3 will never match. Even after I push to force the remote to contain my copy of the file (and not the other modified branch's copy), the latest version ID on s3 will now be different than what is in my .dvc file. So if I run I don't think it's possible to have I suppose we can try checking etags of the latest vs our version and skip pushing if the etags match between our old version ID and the latest version ID, but etag's are not actually guaranteed to reliably work this way, especially for large files (where it depends on how the file was chunked for a multipart upload). |
@pmrowla It will create lots of duplicates, but I think that's considered acceptable for now. The main requirements are:
The proposed optimizations make sense to me but I don't they are required for now. |
Following up on #8354 (comment), the
worktree
config option for remotes should not actually checkout the latest version on the remote.However, we still need a separate
checkout
remote config option to checkout the latest version on the remote. For example, this enables a data registry where the consumers don't need DVC or Git. They can browse the latest version in the cloud knowing it is updated, while the data registry producers can use Git and DVC to version and rollback datasets in the registry as needed.The
checkout
option should make duplicates to restore old file versions, and it should add delete markers to reflect data that was deleted in the workspace. However, thecheckout
option should not change theversion_id
or other info in the.dvc
files. Adding theversion_id
should happen separately with theworktree
option enabled, and the checkout should optionally happen after, only making changes on the remote (nothing in the workspace or.dvc
files).The text was updated successfully, but these errors were encountered: