-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify DbApiHook.run() method with the methods which override it #23971
Conversation
Check #23112 maybe cooporating a fix to that as well can give a more complete solution The goal is to avoid all the custom logic by each provider |
1a41803
to
3f9697c
Compare
Note about changes to DbapiHook. |
6694bb5
to
f1cc33f
Compare
d4f5218
to
d653abd
Compare
What is the current blocker to move forward with this PR? |
d653abd
to
d0fa202
Compare
d0fa202
to
3a073ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tink this work should be connected with #24422 #24476 and any unification should be done there.
This will have the added benefit that "Core SQL" provider might be released independently from Airlfow and newer versions of it can be made prerequisites for the updated "SQL" providers - that will have the "core sql" provider as dependency. I think we should simply deprecate the DBApi and move all the unification effort there.
bb832f3
to
c410b5f
Compare
First - general remarrk. I am not debating with your staments and reminders above. Yeah. Things were happening and we were thiking and discussing various ideas while others were being implemented. And from those discussions, it seems that the "being implemented" change is better to be delayed and reworked and likely incorporated in the other work. Unfortunately, this happpens. Sometimes there are ideas that comes after and result in much better outcome that the first one. And it's pretty normal to redo parts of all of the change when we find a better one. This is normal. We aalso should think about the code as liability rather than asset. Sometimes iterating and improving on a PR leads to coming to better idea, which is result of the learnign during a process. There is much bigger value in an idea done right rather than one that implements a worse solution, even if is already wrritten. BTW. It happened many time for my and other's changes, and if only that results in something that is better for the community - this is perfectly ok, normal and even good. There is also a fallacy of "sunken cost" that you are falling in. The fact that a lot of effort has been invested in something, does not make it any more valuable than alternative solution. If anyone spent a lot of time on something, It does not mean that wee should furher invest in it, if we know we have better solution. The cost is "sunk" already. If the solution is better - we should go for it. If alternative is better - we should go for the alternative. And often that cost is not "totally sunk" - more often than not even tf the code cannot be directly "moved", the learnings from it can and reimplementting same thing using another approach and base takes a fraction of the originally invested time, because the learning/discovery has been already done and the original code can be used as a "blueprint" for reapplying it. If we know better approach the cost that has been done alredy for this PR should not matter. We should simply make better decisions based on the current understanding of a problem. Time spent on doing any solution already shoudl not matter at all - in that decision. Only whether which of the solution is better. Re - 'core sql': We had already a number of problems with DBApi being in core of Airlfow and especially with any efforts there which impacted multiple providers implementing it. We have learned that with the current approach where we have providers released separately from the core, keeping DBApi in "airflow" package has more costs than benefits. The thinking (that was born after you started working on a change and which you change sparked actually - seeing the consequences it cause) is that this approach is not sustainable. For any unification / change that uses current DB Api we will only start reaping any real benefits of in ~ 14 months - i..e about 12 months after 2.4.0 of airflow is released, because only then, any of the code in providers that will only work with the current version of DBApi (2.3 one) can be dropped. And for all 14 months we willl have to live with having to keep backwards compatibility and it will mean that we will not be able to fully utilise the benefits of such unification for more than a year. On the other hand (and this what was the actual driver behind #24422 and #24476 is) - we have a much bigger need to unify the DB API even further. At Airflow Summit we spoke a lot about Lineage and in order to allow column based lineage and generally SQL-lineage, we need to find a better way to make sure that all community-based providers can utilise more "unified" and "fully-featured" DBApi equivalent, that will be able to evolve together with the providers. Thus the idea of 'core.sql' was born. When 'core.sql' is implemented, we will be able to release any DB-API related unifications and changes independently from Airflow releases. This means than 3 months from now we will be able to release all DB providers with new features that anyone will be able to use in Airflow 2.2, 2.3. and that will provide much better lineage integration (and we can deprecate teh DBApi inside Airflow). We will not have to wait 14 months. This is a major win for the community - even if it means that you will have to redo all or parts of your PR. Would you want @kazanzhy to implement your change knowing that it's going to be deprecated almost immediately after and will never be used in reality because we will replace the DB Api with Core.sql provider? Do you think it makes sense? Why would you want to do it? |
And @kazanzhy -> the core.sql provider is already merged yesterday - so what is REALLY the best you can do now is to rebase your changes on top of it and move the DBApiHook there and deprecate the DBApiHook in core airflow - this is absolutely the best we can do now I think and it will make it sooooooo much easier to add new features to the DB related providers. This will not even be a huge change for you I believe. |
There was a bug in an incoming change to common-sql provider introduced in apache#23971 where `;-less` statements were removed when "split_statements" flag was used. Since this flag is used by default in Databricks statement, it introduced backwards incompatible change.
There was a bug in an incoming change to common-sql provider introduced in #23971 where `;-less` statements were removed when "split_statements" flag was used. Since this flag is used by default in Databricks statement, it introduced backwards incompatible change.
After this update, every statement using OracleOperator now fails,
OracleOperator using only single query works, but all operators using block BEGIN now fails with similar message like this:
|
Update with
In the Dag
Still, I am not able to run multiple SQL statements in a single file using RedshiftSQLOperator
|
@hewerthomn @KarthikRajashekaran -> can you please open separate issues for that with all details there? |
Hi @hewerthomn. Sorry for the inconvenience. |
@KarthikRajashekaran it's correct. By this PR we added the ability to
|
@hewerthomn - can you please install common-sql 1.1.0 provider and check if it is fixed ? |
Hi @potiuk and @kazanzhy , I added the package apache-airflow-providers-common-sql==1.1.0, but same errors continue, Log
|
This looks like a PL/SQL not SQL. I guess if you want it to work with Sql then you should wrap your statments with store procedure and then invoke it by SELECT statment. |
I see... So it will stop working with pl/SQL now? If I need to write procedures for all my operators I will use previous version of package instead for a while Thx |
I think this is a bug really @eladkal @kazanzhy - especially if it worked before. I don't think we have any limitation and whether we care if the statement we send is SQL or not in most cases. If it can be sent through DB API and server processes it, it should be fine. Why would we care if it is a select query or PL/SQL? I think the problem in this case is that we strip the |
@hewerthomn can you please create a separate issue and publish the query that worked before? |
@kazanzhy yes, I can post the issue. I was using OracleOperator |
I use sql="DECLARE ..." |
Because PL/SQL has complex multipule statments that can not be splitted. I also think we adressed this already? |
@hewerthomn |
@eladkal ,
|
Yeah. I am not talking about "split_statement". I know it's hard to split such PL/SQL, but I believe the problem is it does not work also when we set split_statement = False. I think the problem is that we currently ALWAYS remove the closing
strip_sql_string:
The result is that If you pass "run" method a statement with ";" at the end it will be converted into one-element array of statements and the Now the question is @kazanzhy - why do we remove that So I think we should have flexibility on whether to remove the semicolon or not. |
So I need to wait for #25717 to get the redshift to execute multiple statements? |
|
Now we have DbApiHook.run() method that is used by many other hooks.
Also there are following hooks that overrides this method:
I did all possible to make them as similar as possible, but some of them have peculiarities:
pyexasol
which doesn't have a reusable cursorhql
parameter and runs_strip_sql(sql)
hql
parameter and runs_strip_sql(sql)