-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability for all operators to interact with storages of AWS/GCP/AZURE #8804
Comments
You are suggesting to implement kind of what we have with GenericTransfer. I've thought of this too. We would need to have a common API for getting and sending files. Preferably a I would rather have an Having So for example having a class GCSHook (GcpBaseHook, FsApiHook):
# ...
def get_file_stream(bucket: str, file_path: str):
return gcsfs.open(file_path)
# ... class S3Hook (AwsBaseHook, FsApi):
# ...
def write_file_stream(bucket: str, file_path: str):
""" Already implemented in https://github.com/apache/airflow/blob/master/airflow/providers/amazon/aws/hooks/s3.py#L582 """
client.upload_fileobj(file_obj, bucket_name, key, ExtraArgs=extra_args)
# ... Having a
Now I can configure the Hooks for each, source and destination, such as: gcs_to_s3 = GenericFileTransfer(
source_conn='gcs-conn-id',
source_hook=GCSHook,
source_bucket='my-gcs-bucket',
source_file='my-file{{ ds }}.csv',
dest_conn='s3-conn-id',
dest_hook=S3Hook,
dest_bucket='my-s3-bucket',
dest_file='my-file{{ ds }}.csv'
) And now you can remove all the specific copy operators like What do you think @turbaszek? I don't know how the Core development is organized (I know there's a Jira and a mailing list, but know nothing of the organization processes). This will be a core change. But if Airflow aims to thrive, it's necessary (IMHO) |
This issue may provide some alternative #8059 |
See https://issues.apache.org/jira/browse/AIRFLOW-2651 for the previous discussion of this. There was a pr #3526 but the interface/code wasn't quite right. |
@turbaszek I see how this is related, and I face this question frequently. But I think #8059 depends on having a common FS interface like #3526 sugests. Right now we have (I might miss some) the following file providers: S3, GCS, Azure, SFTP/SSH, FTP, Samba. If we want to allow all operations between them it means it's |
Would be interesting to implement this using fsspec (https://filesystem-spec.readthedocs.io/en/latest/). Such an implementation would provide an interface to all filesystems supported by fsspec, which includes Azure Blob/GDL2, S3 and GCS. You also get others like FTP and SFTP for free too (https://filesystem-spec.readthedocs.io/en/latest/api.html#implementations). |
Given there are already so many transfer operators (e.g. AWS transfers), I think implementing something like this would be more trouble than it's worth. |
Description
Currently every operator is doing
XToY
. so if someone wroteMySQLToGCS
it doesn't help someone who needsMySQLToS3
.Use case / motivation
It would be great if when PR is raised people will need only to handle the
X
part and provide a list of theY
part. Something like people need to writeXtoDataframe
orXtoFile
and there is build in integration in Airflow that can handle theFileToS3
FileToGCS
etc...So when user is PR
MySQLToFile
Airflow will utilise this and auto createMySQLToGCS
andMySQLToS3
.The idea is to build infrastructure layer once that will be automated for all.
The text was updated successfully, but these errors were encountered: