Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deferrable big query operators and sensors #26156

Merged
merged 8 commits into from
Sep 8, 2022

Conversation

phanikumv
Copy link
Contributor

@phanikumv phanikumv commented Sep 5, 2022

This PR donates the following big query deferrable operators and sensors developed in astronomer-providers repo to apache airflow.

  • BigQueryInsertJobAsyncOperator
  • BigQueryCheckAsyncOperator
  • BigQueryGetDataAsyncOperator
  • BigQueryIntervalCheckAsyncOperator
  • BigQueryValueCheckAsyncOperator
  • BigQueryTableExistenceAsyncSensor

cc @kaxil


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@lwyszomi
Copy link
Contributor

lwyszomi commented Sep 7, 2022

Can we merge the operators with existing one and add parameter to execute it in the deferrable mode? We did this already in the Cloud Composer and Dataproc operators. In my opinion we should have one standard when we want to add deferrable operators. Most of the logic in new operators are the same like we have in non-deferrable operators, this will be nightmare to maintain in the future because if we will have bug in one of it is high probality that we have it also in the other one.

Also in the non-deferrable operators we have links already to the resources inside GCP platform, so it will be good to have them also in the deferrable operators.

@bhirsz
Copy link
Contributor

bhirsz commented Sep 7, 2022

Thanks for the PR! Apologies for the commotion caused (there were several PR on the topic, one with the blatant copy, the one updated version) - we were not fully aware of it. I did comment on both PRs so I re-commented there since it also applies (the code is similar/the same after all).

@kaxil
Copy link
Member

kaxil commented Sep 7, 2022

Can we merge the operators with existing one and add parameter to execute it in the deferrable mode? We did this already in the Cloud Composer and Dataproc operators. In my opinion we should have one standard when we want to add deferrable operators. Most of the logic in new operators are the same like we have in non-deferrable operators, this will be nightmare to maintain in the future because if we will have bug in one of it is high probality that we have it also in the other one.

Also in the non-deferrable operators we have links already to the resources inside GCP platform, so it will be good to have them also in the deferrable operators.

We don't need to merge them into one. We can inheirt from the parent operator and re-use them when needed:

  1. Merging them makes it difficult for Sensors where we already have a "poke" and "reschedule" mode. Does that mean we add a "new mode" for "deferrable" for sensors but for operators we just have "deferrable" as kwarg?
  2. Not all Async operators will be drop-in replacements, some of them will have limitations as compared to sync counterparts, for e.g SimpleHTTPOperator supports a callback whereas with Async operators can't support that
  3. It makes tracking the adoption of async operator a little trickier.
  4. Debugging becomes difficult

@lwyszomi
Copy link
Contributor

lwyszomi commented Sep 7, 2022

Can we merge the operators with existing one and add parameter to execute it in the deferrable mode? We did this already in the Cloud Composer and Dataproc operators. In my opinion we should have one standard when we want to add deferrable operators. Most of the logic in new operators are the same like we have in non-deferrable operators, this will be nightmare to maintain in the future because if we will have bug in one of it is high probality that we have it also in the other one.
Also in the non-deferrable operators we have links already to the resources inside GCP platform, so it will be good to have them also in the deferrable operators.

We don't need to merge them into one. We can inheirt from the parent operator and re-use them when needed:

  1. Merging them makes it difficult for Sensors where we already have a "poke" and "reschedule" mode. Does that mean we add a "new mode" for "deferrable" for sensors but for operators we just have "deferrable" as kwarg?
  2. Not all Async operators will be drop-in replacements, some of them will have limitations as compared to sync counterparts, for e.g SimpleHTTPOperator supports a callback whereas with Async operators can't support that
  3. It makes tracking the adoption of async operator a little trickier.
  4. Debugging becomes difficult

@kaxil
In my opinion we should work out the one structure of the operators and keep this consistent in future for better user experience and better development process.

Ad 1: Whether adding deferred operators was, to stop using the sensors? Async Sensors which was added in the PR look like normal operators to confirm that something was created correctly, but I think this should be showed up in the async operator. Maybe I don't see something.

For example BigQueryTableExistenceAsyncSensor do we need that kind od sensor, in my opinion we should consider extend Operator which creates a Table and add deferrable mode there.

Ad 2. I'm still learning what operators we have and I didn't know about this, but probably we can work out a good solution for that operators

Ad 3. Why? if the operator have deferrable in the kwarg we know that this operator have this mode. Is simple to track and is simple for the customers to migrate to this approach

Ad 4. For some operators I agree, it can be a little tricky, a good example is BigQueryCheckAsyncOperator and BigQueryCheckOperator.

@phanikumv phanikumv force-pushed the astro_bigquery branch 6 times, most recently from 8fc7cf6 to 398067f Compare September 8, 2022 11:55
@kaxil
Copy link
Member

kaxil commented Sep 8, 2022

Ad 1: Whether adding deferred operators was, to stop using the sensors? Async Sensors which was added in the PR look like normal operators to confirm that something was created correctly, but I think this should be showed up in the async operator. Maybe I don't see something.

We added deferrable operator for two reasons: 1) For all sensors (file available on S3) and "Operators that does some kind of polling" (e.g BigqueryOperator that just submits an API request and waits till query completed execution). It was not built to stop using Sensors but make them efficient. So we can say the new Async sensors and async operators are better than the coutnerparts

For example BigQueryTableExistenceAsyncSensor do we need that kind od sensor, in my opinion we should consider extend Operator which creates a Table and add deferrable mode there.

Sensors are special kind of operator, where you want to perform a task after a certain condition is met --> so we can't remove that "sensor" as I am not sure we can identify what all users might be using it for

Ad 2. I'm still learning what operators we have and I didn't know about this, but probably we can work out a good solution for that operators

Yeah not all client libraries officially support "async" too and we need to use 3rd party libraries which don't have a full support -- so while the idea is that they are directly replaceable ideally, it might not be the case everytime. For example, with SimpleHTTPOperator requests library doesn't support it -- so we need to use a different library like aiohttp

@kaxil kaxil merged commit f938cd4 into apache:main Sep 8, 2022
@kaxil kaxil deleted the astro_bigquery branch September 8, 2022 21:17
@kaxil
Copy link
Member

kaxil commented Sep 8, 2022

Let's make any changes in a follow-up PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants