Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec base Hook for filesystem operations #33578

Closed
wants to merge 2 commits into from

Conversation

AlexisBRENON
Copy link
Contributor

@AlexisBRENON AlexisBRENON commented Aug 21, 2023

Related: #8804, #3526

As explained in the related issue, it can be hard to cover all the <provider>To<provider> use cases. It could be nice to have a generic "file transfer" operator which can handle most of the storage provider (local filesystem as well as cloud blob storages). Such kind of storage wrapper already exists as the fsspec library which is already an Airflow dependency.

This PR aims at providing a base "class" (it's more like a mixin) to mix with blob storage provider (such as S3, GCS, etc.) to allow easy creation of fsspec.FileSystem from such hooks.

I am not very fluent with the Hook/Connection API, so I am not sure that this PR make sense.


@potiuk
Copy link
Member

potiuk commented Aug 22, 2023

We are about to unsuspend yandex provider (which was the main reason why the PR failed) . You should rebase after we merge it: #33574

To be honest, I am not sure if we need it. I think there is a very little benefit from having such "common" provider. We already have (and continue having) a big share of problems resulting from common.sql extraction - and there was a good reason for that extraction - i.e. Open Lineage integration might benefit from it.

With Filesstem, I am not sure what are the benefits. The key thing is -extracing code is cool (because of DRY) but it introduces coupling (i.e. all users of the common code become implicitly dependent). And it introduces all kinds of issues where change in the common code might lead to dubtle bugs in related providers.

I yet fail to see what would be the benefits of introducing such common code. But maybe I am short-sited?

@AlexisBRENON
Copy link
Contributor Author

To be honest, I am not sure if we need it. I think there is a very little benefit from having such "common" provider. [...]
[...] it introduces coupling (i.e. all users of the common code become implicitly dependent). And it introduces all kinds of issues where change in the common code might lead to dubtle bugs in related providers.

I understand your point of view and agree that having such generic class mixed with a lot of other ones make changes hard to do and to predict all the impacts.

What about a less "common" provider, more focused on fsspec.Filesystem creation ?
My actual use case is to move some data from one bucket to another. Currently, my operator take two URIs as arguments, as well as one connection ID (for GCS access in my case). But this requires some rework if I need:

  • to change one of the storage from GCS to S3 for exemple,
  • different credentials for each buckets.

Having a FileSystem connection type would allow me to provision a connection for each of my storage service and using only two connections ID as arguments of the operator.

I yet fail to see what would be the benefits of introducing such common code. But maybe I am short-sited?

Do you think that this use case would benefit from introducing a FileSystem connection, or do I mis-conceived my operator and/or workflow ?

@potiuk
Copy link
Member

potiuk commented Aug 23, 2023

To be honest, I am not sure if we need it. I think there is a very little benefit from having such "common" provider. [...]
[...] it introduces coupling (i.e. all users of the common code become implicitly dependent). And it introduces all kinds of issues where change in the common code might lead to dubtle bugs in related providers.

I understand your point of view and agree that having such generic class mixed with a lot of other ones make changes hard to do and to predict all the impacts.

What about a less "common" provider, more focused on fsspec.Filesystem creation ? My actual use case is to move some data from one bucket to another. Currently, my operator take two URIs as arguments, as well as one connection ID (for GCS access in my case). But this requires some rework if I need:

  • to change one of the storage from GCS to S3 for exemple,
  • different credentials for each buckets.

Having a FileSystem connection type would allow me to provision a connection for each of my storage service and using only two connections ID as arguments of the operator.

I yet fail to see what would be the benefits of introducing such common code. But maybe I am short-sited?

Do you think that this use case would benefit from introducing a FileSystem connection, or do I mis-conceived my operator and/or workflow ?

I suggest you to do a different thing. Write a taskflow operator for your case using hooks. I wrote a blog post about it some time ago - and IMHO using this kind of approach is way better than trying to make a "generic" transfer operator

https://medium.com/apache-airflow/generic-airflow-transfers-made-easy-5fe8e5e7d2c2

@@ -145,7 +146,7 @@ def wrapper(*args, **kwargs) -> T:
return cast(T, wrapper)


class S3Hook(AwsBaseHook):
class S3Hook(AwsBaseHook, FsApiHook):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, what kind of behaviour expected here?

@github-actions
Copy link

github-actions bot commented Oct 8, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 8, 2023
@github-actions github-actions bot closed this Oct 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:alibaba provider:amazon AWS/Amazon - related issues provider:databricks provider:ftp provider:github provider:google Google (including GCP) related issues provider:http stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants