-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: automatically inject OL transport info into spark jobs #45326
Merged
mobuchowski
merged 1 commit into
apache:main
from
kacpermuda:feat-ol-inject-transport-dataproc-job
Jan 9, 2025
Merged
feat: automatically inject OL transport info into spark jobs #45326
mobuchowski
merged 1 commit into
apache:main
from
kacpermuda:feat-ol-inject-transport-dataproc-job
Jan 9, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
d9a52b6
to
78b57e4
Compare
78b57e4
to
9f45650
Compare
9f45650
to
6a85faa
Compare
Hey @kacpermuda -> I rebased your PR -> we found and issue with @jscheffl with the new caching scheme - fixed in #45347 that would run "main" version of the tests-> so I am rebasing all PRs affected :) |
1523cd9
to
4f46713
Compare
mobuchowski
approved these changes
Jan 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, looking good otherwise.
providers/src/airflow/providers/google/cloud/openlineage/utils.py
Outdated
Show resolved
Hide resolved
4362604
to
a3a62c4
Compare
Signed-off-by: Kacper Muda <[email protected]>
a3a62c4
to
f281ddf
Compare
agupta01
pushed a commit
to agupta01/airflow
that referenced
this pull request
Jan 13, 2025
…45326) Signed-off-by: Kacper Muda <[email protected]>
karenbraganz
pushed a commit
to karenbraganz/airflow
that referenced
this pull request
Jan 13, 2025
…45326) Signed-off-by: Kacper Muda <[email protected]>
HariGS-DB
pushed a commit
to HariGS-DB/airflow
that referenced
this pull request
Jan 16, 2025
…45326) Signed-off-by: Kacper Muda <[email protected]>
got686-yandex
pushed a commit
to got686-yandex/airflow
that referenced
this pull request
Jan 30, 2025
…45326) Signed-off-by: Kacper Muda <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area:providers
kind:documentation
provider:common-compat
provider:google
Google (including GCP) related issues
provider:openlineage
AIP-53
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Similar to #44477 , this PR introduces a new feature to OpenLineage integration. It will NOT impact users that are not using OpenLineage or have not explicitly enabled this feature (False by default).
TLDR;
When explicitly enabled by the user for supported operators, we will automatically inject transport information into the Spark job properties. For example, when submitting a Spark job using the DataprocSubmitJobOperator, we will configure Spark/OpenLineage integration to use the same transport configuration that Airflow integration uses.
Why ?
Currently, this process requires manual configuration by the user, as described here. E.g.:
Understanding how various Airflow operators configure Spark allows us to automatically inject transport information.
Controlling the Behavior
We provide users with a flexible control mechanism to manage this injection, combining per-operator enablement with a global fallback configuration. This design is inspired by the
deferrable
argument in Airflow.Each supported operator will include an argument like
ol_inject_transport_info
, which defaults to the global configuration value ofopenlineage.spark_inject_transport_info
. This approach allows users to:This design ensures both flexibility and ease of use, enabling users to fine-tune their workflows while minimizing repetitive configuration. I am aware that adding an OpenLineage-related argument to the operator will affect all users, even those not using OpenLineage, but since it defaults to False and can be ignored, I hope this will not pose any issues.
How?
The implementation is divided into three parts for better organization and clarity:
Operator's Code (including the
execute
method):Contains minimal logic to avoid overwhelming users who are not actively working with OpenLineage.
Google's Provider OpenLineage Utils File:
Handles the logic for accessing Spark properties specific to a given operator or job.
OpenLineage Provider's Utils:
Responsible for creating / extracting all necessary information in a format compatible with the OpenLineage Spark integration. We are also performing modifications to the Spark properties here.
For some operators parts 1 and 2 may be in the operator's code. In general, the specific operator / provider will know how to get the spark properties and the OL will know what to inject and do the injection itself.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.