-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: support version-aware filesystems #133
Conversation
b9d9c2b
to
4c95e87
Compare
src/dvc_objects/fs/base.py
Outdated
if "version_aware" in kwargs: | ||
self.fs_args["version_aware"] = kwargs.pop("version_aware") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do this in particular filesystems that use it? E.g. dvc-azure, dvc-s3, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think versioning is going to end up as implementation specific in fsspec then it would make sense to do it in the individual dvc fs plugins
But given that we are probably going to end up being responsible for doing all of the underlying fsspec versioning implementations (other than s3fs) I think it might make more sense for us to try and standardize this, based on s3fs? (i.e all fsspec filesystems that support versioning should use version_aware
as the flag to indicate that it's a version aware FS)
And then on the DVC side we would only need to account for versioning in our base fsspec wrapper.
src/dvc_objects/fs/path.py
Outdated
def split_version(self, path) -> Tuple[str, Optional[str]]: | ||
if self.version_query_key: | ||
parts = list(urlsplit(path)) | ||
query = parse_qs(parts[3]) | ||
if self.version_query_key in query: | ||
version_id = first(query[self.version_query_key]) | ||
del query[self.version_query_key] | ||
parts[3] = urlencode(query) | ||
else: | ||
version_id = None | ||
return urlunsplit(parts), version_id | ||
return path, None | ||
|
||
def join_version(self, path: str, version_id: Optional[str]) -> str: | ||
if not self.version_query_key: | ||
raise ValueError("path does not support versioning") | ||
parts = list(urlsplit(path)) | ||
query = parse_qs(parts[3]) | ||
if self.version_query_key in query: | ||
raise ValueError("path already includes a version query") | ||
parts[3] = f"versionid={version_id}" if version_id else "" | ||
return urlunsplit(parts) | ||
|
||
def version_path(self, path: str, version_id: Optional[str]) -> str: | ||
path, _ = self.split_version(path) | ||
return self.join_version(path, version_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably create AzurePath
and add these methods there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This felt more like generic path methods to me. All of the filesystems that support versioning (s3, azure, gcs) use the querystring for specifying versions (and I'm not sure how else you would set a version in URL-equivalent paths?), so the only FS-specific part of versioning is the specific query param/key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and again, this is utils that I would expect to go in fsspec eventually, but for now it made more sense to me to have it in dvc-objects (shared across all of our fs's that support versioning) rather than fs specific path implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you plan to use this, @pmrowla? Will it use odb
or some APIs of dvc-objects
and dvc-data
that require version awareness?
If it's going to be a separate abstraction, I'd prefer it be implemented in individual plugins rather than here.
this is utils that I would expect to go in fsspec eventually
The Path
is not upstreamed yet, so we should not add any more of our expectations/requirements to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you plan to use this, @pmrowla? Will it use
odb
or some APIs ofdvc-objects
anddvc-data
that require version awareness?If it's going to be a separate abstraction, I'd prefer it be implemented in individual plugins rather than here.
dvc-data is going to need to be aware of versioning metadata at some point. version_id
is essentially just a per-file meta field, the same as isexec
.
And when dealing with version-aware filesystems, a complete path includes the versioning query string.
ddbe86c
to
b30874b
Compare
I moved the path versioning into dvc-s3 and dvc-azure (using S3Path/AzurePath subclasses). I still think we need a proper VersionedPath base in either dvc-objects.fs or fsspec but it's not a big deal either way right now. |
@cached_property | ||
def version_aware(self) -> bool: | ||
return getattr(self.fs, "version_aware", False) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our goal is moving towards fsspec, we need to either add it to the spec (probably not worth it) or just do the getattr manually. If we only have a handful of places - might want to go with the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: discussed that this indeed makes sense to have for now.
prereq for iterative/dvc#8164
Eventually these methods should go in fsspec but for short-term release purposes it's done here