-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RedshiftDataHook #19137
Add RedshiftDataHook #19137
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use RedshiftSQLHook or RedshiftHook? Why do we need a new hook? It's not clear to me.
I suggest we add these methods to the existing RedshiftHook and create operator(s) that would do what we want
airflow/providers/amazon/aws/example_dags/example_redshift_data_execute_sql.py
Outdated
Show resolved
Hide resolved
airflow/providers/amazon/aws/example_dags/example_redshift_data_execute_sql.py
Outdated
Show resolved
Hide resolved
airflow/providers/amazon/aws/example_dags/example_redshift_data_execute_sql.py
Outdated
Show resolved
Hide resolved
RedshiftSQLHook extends DbApiHook, whereas RedshiftDataHook is based on AwsBaseHook. While they have similar capabilities, the underlying methods are completely different. If we changed RedshiftSQLHook it would break anything based on at Postgres connection.
RedshiftHook is based on the boto3 redshift client, which is a set of functions for managing the Redshift cluster and does not provide any data access. RedshiftDataHook is based on the boto3 redshift-data client, which is a set of functions for accessing Redshift data. I don't believe we have any other examples where a single AWS hook wraps two different boto3 clients, so that would dictate having two different hooks. Further, the Redshift Data API has different IAM policy definitions, so someone with access to data may not have access to cluster management, and vice-versa. |
In that case, to avoid confusion, we should separate this to a different file e.g Let's create operators for each of those uses where we used the python operator and also add documentation on how to use the hooks/operators. These will improve user experience |
OK let me get working on that. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that's required here is a system test. That would help us to verify that the example works as expected.
See https://github.com/apache/airflow/blob/main/tests/providers/amazon/aws/operators/test_ecs_system.py for an example system test.
Here too is more information about a system test: https://github.com/apache/airflow/blob/main/TESTING.rst#airflow-system-tests
airflow/providers/amazon/aws/example_dags/example_redshift_data_execute_sql.py
Show resolved
Hide resolved
:type with_event: bool | ||
|
||
""" | ||
"""only provide parameter argument if it is valid""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""only provide parameter argument if it is valid""" | |
# only provide parameter argument if it is valid |
We may not need this comment though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks not done yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. likely missed during rebase.
Is this something taht you are still working on @john-jac ? |
Yes it’s still on my list. There were a number of additions requested. Some I’ve addressed in my comments, the rest I hope to add this week. |
airflow/providers/amazon/aws/example_dags/example_redshift_data_execute_sql.py
Outdated
Show resolved
Hide resolved
docs/apache-airflow-providers-amazon/operators/redshift_data.rst
Outdated
Show resolved
Hide resolved
:type autocommit: bool | ||
""" | ||
|
||
template_fields = ('sql',) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any other fields to add here that users might want to have flexibility to dynamically generate a value for? Maybe parameters
, cluster_identifier
, database
, and/or db_user
. The latter 3 could maybe be added to a Connection Extra and accessed with Jinja now that the Connection
object is accessible in the template context like "{{ conn.conn_id.extra_dejson.<attr> }}"
when calling the operator. No strong opinions though just something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. The more template_fields the better. It has pretty much no side effects but makes your operators much more flxible and allow to build much more sophissticated cases.
@ephraimbuddy I don't see other similar operators, such as sagemaker or glue, using system tests. Are we sure this is necessary for redshift-data? |
Agree that some other operators don't have it but this will really help us to validate your example dag and the operator. No strong opinion though |
OK as a best practice I will add it. Thanks! |
After #20276 |
It is different. See description in the comments above. Very sorry for the delay. The additional tests requested are still to be coded. |
I tihnk this one needs to be rebased. It's 73 commits behind. |
Should be up to date now |
FYI - Now that #20951 is merged, you don't need to specify the Sphinx |
before this is merged can we get an agreement on authorship? it was opened by @john-jac but @vincbeck is taking over the work. only one can be author and one must be co-author. i am thinking that author should be @vincbeck and co-author @john-jac because it seems that probably more than 50% of the effort will have landed on @vincbeck by the time this is merged (and seeing it through to the end can be somewhat harder than getting it started). if you both agree, @vincbeck please rebase / squash as necessary to ensure that you have the first commit in this branch this way when we finally squash and merge we can ensure appropriate attribution. unless you don't care in which case you may leave it and you'll end up as co-author. thanks |
I agree @vincbeck should be the author as he is pushing it to completion. |
3a3c542
to
64b9b64
Compare
I stashed all the commits into one commit and added @john-jac as co-author. Everything should be fine |
There are some static check failures to fix. |
c821485
to
8afcea5
Compare
b5bed88
to
5f10eba
Compare
Use the AWS Boto3 library to query and retrieve data via the AWS API and avoid the necessity of a Postgres connection. Co-authored-by: john-jac <[email protected]>
My bad. It's now fixed |
😄 thanks @dstandish ! |
spoke too soon apparently, had to make one small rename (done it, don't worry) |
Airflow users wishing to trigger Amazon Redshift queries today have to use a Postgres connection. While effective, this has the limitation that the Redshift cluster must expose a Postgres endpoint that is accessible to the Airflow environment.
As an alternative, the AWS Boto3 library exposes a client called redshift-data that allows users to query and retrieve data via the AWS API and avoid the necessity of a Postgres connection.
This PR adds a RedshiftDataHook to the existing Redshift provider, along with tests and an example, which wraps the "redshift-data" Boto3 client and allows users to run Redshift queries without a Postgres connection.
In the future, this capability could be added to any existing Redshift transfer operator, with an option to auto-detect whether to use Postgres or Boto3 based upon the connection type.