Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: support version-aware filesystems #133

Merged
merged 1 commit into from
Aug 31, 2022
Merged

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Aug 24, 2022

prereq for iterative/dvc#8164

Eventually these methods should go in fsspec but for short-term release purposes it's done here

Comment on lines 80 to 81
if "version_aware" in kwargs:
self.fs_args["version_aware"] = kwargs.pop("version_aware")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this in particular filesystems that use it? E.g. dvc-azure, dvc-s3, etc

Copy link
Contributor Author

@pmrowla pmrowla Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think versioning is going to end up as implementation specific in fsspec then it would make sense to do it in the individual dvc fs plugins

But given that we are probably going to end up being responsible for doing all of the underlying fsspec versioning implementations (other than s3fs) I think it might make more sense for us to try and standardize this, based on s3fs? (i.e all fsspec filesystems that support versioning should use version_aware as the flag to indicate that it's a version aware FS)

And then on the DVC side we would only need to account for versioning in our base fsspec wrapper.

Comment on lines 150 to 175
def split_version(self, path) -> Tuple[str, Optional[str]]:
if self.version_query_key:
parts = list(urlsplit(path))
query = parse_qs(parts[3])
if self.version_query_key in query:
version_id = first(query[self.version_query_key])
del query[self.version_query_key]
parts[3] = urlencode(query)
else:
version_id = None
return urlunsplit(parts), version_id
return path, None

def join_version(self, path: str, version_id: Optional[str]) -> str:
if not self.version_query_key:
raise ValueError("path does not support versioning")
parts = list(urlsplit(path))
query = parse_qs(parts[3])
if self.version_query_key in query:
raise ValueError("path already includes a version query")
parts[3] = f"versionid={version_id}" if version_id else ""
return urlunsplit(parts)

def version_path(self, path: str, version_id: Optional[str]) -> str:
path, _ = self.split_version(path)
return self.join_version(path, version_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably create AzurePath and add these methods there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This felt more like generic path methods to me. All of the filesystems that support versioning (s3, azure, gcs) use the querystring for specifying versions (and I'm not sure how else you would set a version in URL-equivalent paths?), so the only FS-specific part of versioning is the specific query param/key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and again, this is utils that I would expect to go in fsspec eventually, but for now it made more sense to me to have it in dvc-objects (shared across all of our fs's that support versioning) rather than fs specific path implementations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to use this, @pmrowla? Will it use odb or some APIs of dvc-objects and dvc-data that require version awareness?

If it's going to be a separate abstraction, I'd prefer it be implemented in individual plugins rather than here.

this is utils that I would expect to go in fsspec eventually

The Path is not upstreamed yet, so we should not add any more of our expectations/requirements to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to use this, @pmrowla? Will it use odb or some APIs of dvc-objects and dvc-data that require version awareness?

If it's going to be a separate abstraction, I'd prefer it be implemented in individual plugins rather than here.

dvc-data is going to need to be aware of versioning metadata at some point. version_id is essentially just a per-file meta field, the same as isexec.

And when dealing with version-aware filesystems, a complete path includes the versioning query string.

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 26, 2022

I moved the path versioning into dvc-s3 and dvc-azure (using S3Path/AzurePath subclasses). I still think we need a proper VersionedPath base in either dvc-objects.fs or fsspec but it's not a big deal either way right now.

@pmrowla pmrowla changed the title [WIP] fs.path: add versioning util methods [WIP] fs: support version-aware filesystems Aug 26, 2022
@pmrowla pmrowla marked this pull request as ready for review August 30, 2022 13:18
@pmrowla pmrowla changed the title [WIP] fs: support version-aware filesystems fs: support version-aware filesystems Aug 30, 2022
Comment on lines +112 to +115
@cached_property
def version_aware(self) -> bool:
return getattr(self.fs, "version_aware", False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our goal is moving towards fsspec, we need to either add it to the spec (probably not worth it) or just do the getattr manually. If we only have a handful of places - might want to go with the latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: discussed that this indeed makes sense to have for now.

@efiop efiop merged commit 966d4e3 into iterative:main Aug 31, 2022
@pmrowla pmrowla deleted the path-versioning branch September 1, 2022 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants