Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE/RESEARCH] Async RedshiftDataOperator #417

Closed
phanikumv opened this issue Jun 6, 2022 · 3 comments
Closed

[SPIKE/RESEARCH] Async RedshiftDataOperator #417

phanikumv opened this issue Jun 6, 2022 · 3 comments
Assignees
Labels
area/async Deferrable/async operators research Requires research or investigation

Comments

@phanikumv
Copy link
Collaborator

phanikumv commented Jun 6, 2022

Acceptance Criteria:
Find a Python library that supports Asynchronous implementation for RedshiftDataOperator if the official library does not support it.Document possible options and selection reasons for a particular library in this GitHub issue via a Summary comment.

Ensure that connection is set up and working.

@phanikumv phanikumv changed the title [SPIKE/RESEARCH] RedshiftDataOperator [SPIKE/RESEARCH] Async RedshiftDataOperator Jun 6, 2022
@phanikumv phanikumv added area/async Deferrable/async operators research Requires research or investigation labels Jun 6, 2022
@pankajkoti
Copy link
Collaborator

The RedshiftDataOperator has the same objective as of RedshiftSQLOperator to submit SQL statement for execution to the Redshift cluster. The difference between the two is that the RedshiftSQLOperator needs the postgres endpoint connection to be created for the Redshift cluster (the default connection name being redshift_default), whereas, RedshiftDataOperator does not need any additional connection (postgres endpoint of the cluster) to be created and it uses the AWS connection (default connection name aws_default) itself together with boto library to connect to the Redshift cluster.

The execution time of the RedshiftDataOperator operator varies based on the SQL statement submitted, meaning it will take as much time as the Redshift cluster would need to run the SQL statement. This is the same in case of the RedshiftSQLOperator. In my opinion, it qualifies to have an async version. We have have the async version of RedshiftSQLOperator and hence believe that we should also implement it for RedshiftDataOperator. The RedshiftDataOperator return the query ID for the submitted SQL and we can query the status of this query ID for polling asynchronously.

@pankajkoti
Copy link
Collaborator

Airflow reference PR where the RedshiftDataOperator was added: apache/airflow#19137

This PR includes all the discussions on why this operator was added and what is the fundamental difference between RedshiftDataOperator and RedshiftSQLOperator

@pankajkoti
Copy link
Collaborator

Conclusion: Implement the RedshiftDataOperator async operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/async Deferrable/async operators research Requires research or investigation
Projects
None yet
Development

No branches or pull requests

2 participants