-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add CLL to OpenLineage in BigQueryInsertJobOperator #44872
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
79c6ad4
to
b383b93
Compare
3dd80a3
to
97860a3
Compare
0edb6d4
to
52be5f9
Compare
mobuchowski
approved these changes
Dec 16, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - one question that would be nice to answer.
providers/src/airflow/providers/google/cloud/openlineage/mixins.py
Outdated
Show resolved
Hide resolved
52be5f9
to
25f6866
Compare
25f6866
to
61a1ad8
Compare
PR is ready to be merged, there is some unrelated flaky test that makes TestLocalExecutor timeout. |
61a1ad8
to
0036bc0
Compare
462769e
to
ca15752
Compare
Signed-off-by: Kacper Muda <[email protected]>
ca15752
to
ee8ea15
Compare
mobuchowski
approved these changes
Dec 30, 2024
LefterisXefteris
pushed a commit
to LefterisXefteris/airflow
that referenced
this pull request
Jan 5, 2025
Signed-off-by: Kacper Muda <[email protected]>
agupta01
pushed a commit
to agupta01/airflow
that referenced
this pull request
Jan 6, 2025
Signed-off-by: Kacper Muda <[email protected]>
HariGS-DB
pushed a commit
to HariGS-DB/airflow
that referenced
this pull request
Jan 16, 2025
Signed-off-by: Kacper Muda <[email protected]>
got686-yandex
pushed a commit
to got686-yandex/airflow
that referenced
this pull request
Jan 30, 2025
Signed-off-by: Kacper Muda <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BigQueryInsertJobOperator already support OpenLineage for QUERY type jobs, but it lacks Column Level Lineage (CLL).
This PR introduces CLL (Column-Level Lineage) to this operator based on SQL parsing, which can be useful in straightforward scenarios. However, since SQL parsing alone might not always provide all the details (e.g. in SQL query we can reference table only by table name, or dataset.table without the project_id), checks have been implemented to ensure accurate lineage. As a result CLL may not be included when there is uncertainty about its correctness.
There is another change not related to CLL: right now output table is duplicated into input tables. We are creating a list of input tables based on
referencedTables
property provided by Google and as it turns out, this also includes the destination table. So f.e. this query:INSERT INTO
a.b.cVALUES (1, "a", 23)
would return
a.b.c
as input table and output table.This PR fixes it by removing output table from input tables. I am not sure if it's a correct approach as sometimes users may write a query that performs a process that moves data from one table to the same table but i think this is rare and also this kind of lineage information (from A to A) does not provide much value. Please let me know if you think I'm wrong.
I also refactored the mixin a bit to make it clearer and prepare for adding support for job types other than QUERY. I also change the class name - in the beginning it's supposed to be a general mixin, but BigQueryInsertJobOperator is so complex that this mixin will only be used with that class.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.