Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] import-url: cloud versioned imports #8164

Closed
wants to merge 7 commits into from

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Aug 23, 2022

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • WIP proof of concept
  • requires other WIP dvc-data and dvc-objects changes which are not pushed yet

@dberenbaum
Copy link
Collaborator

Thanks, @pmrowla!

Some questions and my thoughts on them:

  • Does it work for directories? How are they stored?
  • Can it work without configuring a DVC remote? I think this needs to be supported.
  • When to push cloud-versioned files to the DVC remote? Can we push only if there is no cloud version ID? It might seem a bit inconsistent or unexpected, but I think it might be a reasonable default.
  • Should this behavior be controlled be the same flag that controls the human-readable cache? I think it needs to be separate. Any reason not to have this behavior by default?
  • Should the version ID be specified by --rev or a new flag? I slightly prefer reusing --rev and expanding its meaning since it brings us closer to fusing import and import-url and hints that the cloud data source acts like a remote, but I'm open to a new flag if there is a strong argument.

@dberenbaum
Copy link
Collaborator

@pmrowla Do you expect we can get this merged with the questions above resolved this sprint (Sep 6)?

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 25, 2022

  • Does it work for directories? How are they stored?

This currently only supports files and not directories (how to handle/store dirs requires more discussion).


  • Should this behavior be controlled be the same flag that controls the human-readable cache? I think it needs to be separate. Any reason not to have this behavior by default?

We cannot enable this by default right now, because we have to explicitly enable versioning when we connect with fsspec's API. (Currently fsspec does not support any way to detect whether or not an s3 or azure bucket has versioning support enabled, and it relies on the caller telling it that the bucket supports versioning).

So right now, in this PR we only use fsspec in cloud-version aware mode if we are using a DVC remote that has the human-readable/version-aware option set.

What we could do is assume that if the user passes an explicit versioned path (azure://container/file?versionid=...) that we should be using the cloud-versioning import feature. But for non-versioned paths (azure://container/file) there's no way for DVC to tell whether or not this user means "import container/file from a regular/non-versioned bucket", or "import the most recent version of container/file from a versioned bucket".

One option would be to add something like an optional --version flag, so either of these imports would tell DVC to use the versioning feature:

# import and pin version from explicitly versioned path
$ dvc import-url azure://container/file?versionid=1234...

# import and pin latest version from explicit flag and non-versioned path
$ dvc import-url --version-aware azure://container/file

  • Can it work without configuring a DVC remote? I think this needs to be supported.

This is not currently implemented in this PR (due to the fsspec explicit version-aware issue from the last question), but in theory this would work without configuring a DVC remote if either:

  • the bucket is public does not require authentication
  • the user has the correct environment for azure or s3 CLI authentication

i.e. for azure we could support the following without using the DVC remote:// syntax, assuming that my az login account and the env var connection string have the right permissions to access the test-container bucket:

$ az login
$ export AZURE_STORAGE_CONNECTION_STRING="..."
$ dvc import-url "azure://test-container/dir/file?versionid=2022-08-09T06:33:06.7654238Z"

If the user does not have the right s3/az CLI environment to access the bucket (and wants to specify some non-default AWS profile or azure connection string), they would have to use the DVC remote syntax to provide specific credentials.

All of this is the same as existing behavior/limitations for import-url with buckets requiring authentication, it's not specific to the versioning feature.


  • When to push cloud-versioned files to the DVC remote? Can we push only if there is no cloud version ID? It might seem a bit inconsistent or unexpected, but I think it might be a reasonable default.

I'm not sure that whether or not there is a cloud version ID matters here. This is basically the same as the pull question from #8172 (comment)

If we intend to support both push/pull to and from original source location and push/pull to and from DVC remotes, we need to decide which is the default and how to tell DVC when you don't want the default

One thing to remember is that we do not support push/pull to a DVC remote for DVC repo imports (#4527). DVC-import data always comes from the original source repo. I'm assuming that whatever we decide to do, ideally we want to move toward consistent/unified import/import-url behavior, so if we support both "original source" and DVC remotes for import-url, we would probably want a solution that addresses #4527 for regular import as well.


  • Should the version ID be specified by --rev or a new flag? I slightly prefer reusing --rev and expanding its meaning since it brings us closer to fusing import and import-url and hints that the cloud data source acts like a remote, but I'm open to a new flag if there is a strong argument.

This PR uses --rev right now, but I don't feel too strongly either way (whether we use --rev or something like --version/--version-id).

One thing to consider is that in the potential future "DVC without Git" scenario, there will probably be actual "DVC revisions/revs" (for an entire "cloud versioned DVC repo" state) which would be a separate concept from individual file version-IDs within the same bucket.

@dberenbaum
Copy link
Collaborator

We cannot enable this by default right now, because we have to explicitly enable versioning when we connect with fsspec's API. (Currently fsspec does not support any way to detect whether or not an s3 or azure bucket has versioning support enabled, and it relies on the caller telling it that the bucket supports versioning).

One option would be to add something like an optional --version flag

We can go with a --version flag for now. Eventually, I think we will need to determine whether versioning support is enabled. From a brief look, I think there are at least ways in the underlying SDKs to check if the contents have a version ID, although I understand that might require more fsspec changes.

This is not currently implemented in this PR (due to the fsspec explicit version-aware issue from the last question)

To me, it's another signal that this shouldn't depend on a remote-specific config option.

All of this is the same as existing behavior/limitations for import-url with buckets requiring authentication, it's not specific to the versioning feature.

If it works without a remote as long as the appropriate environment is set (like import-url today), that would be great.

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 26, 2022

With the current set of PR's this works without DVC remotes if the user provides a versioned path (azure://container/file?versionid=1234... or s3://bucket/file?versionId=1234...), I have not added the import-url --version-aware flag to enable using a path without versionid yet

@pmrowla pmrowla force-pushed the import-url-version branch 4 times, most recently from fbdc582 to e895699 Compare August 26, 2022 08:35
Comment on lines 76 to 78
current = self.version_id
fs_path = self.fs.path.version_path(self.fs_path, None)
updated = self.fs.info(fs_path)["version_id"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: discussed that we can capture this during object build() in meta. Also might want to add something like a no_hash=True flag to build, to avoid computing hashes.

@pmrowla pmrowla force-pushed the import-url-version branch from 6a55756 to 00c96fc Compare August 31, 2022 05:40
@pmrowla
Copy link
Contributor Author

pmrowla commented Sep 1, 2022

this PR is going to be split into separate PRs for the base output changes and import specific changes

@pmrowla pmrowla closed this Sep 1, 2022
@pmrowla pmrowla mentioned this pull request Sep 1, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push enhancement Enhances DVC
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants