Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract and send events with raw sql from spark. #2913

Merged
merged 1 commit into from
Aug 20, 2024

Conversation

Imbruced
Copy link
Contributor

@Imbruced Imbruced commented Aug 5, 2024

Problem

Currently for spark application we don't send any information about used sql query, for example
spark.sql(
"SELECT id FROM table"
).save......

will track that we used table and saved that somewhere but the raw sql won't be recorded in any event

"SELECT id FROM table"

structure.

Closes: #ISSUE-NUMBER

Solution

We can look into QuerExecution children up we find the sqlQuery, the downside of this using bfs algorithm is we keep only one query. So in case like

spark.sql("SELECT 1 as id").createOrReplaceTempView("tab1")
spark.sql("SELECT 2 as id").createOrReplaceTempView("tab2")

spark.sql("select * from tab1 union all select * from tab2")

we store only query select * from tab1 union all select * from tab2

that's the limitation of sql query facet where we can store only one value.

Field which we are looking for is available only for spark 3.3 and above

One-line summary:

Traversing QueryExecution to get sql field with bfs algorithm.

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

@boring-cyborg boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Aug 5, 2024
@Imbruced Imbruced force-pushed the feature/spark/sql-extract branch 2 times, most recently from 69d029b to cbb5e15 Compare August 7, 2024 08:56
@Imbruced Imbruced marked this pull request as ready for review August 7, 2024 10:59
@Imbruced Imbruced requested a review from a team as a code owner August 7, 2024 10:59
@Imbruced Imbruced force-pushed the feature/spark/sql-extract branch from 246e79a to 36d8e5d Compare August 7, 2024 14:44
@dolfinus
Copy link
Contributor

dolfinus commented Aug 9, 2024

Please update PR description and title

@Imbruced Imbruced changed the title Feature/spark/sql extract Extract and send events with raw sql from spark. Aug 12, 2024
Copy link
Member

@mobuchowski mobuchowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did only high level review with two items - because IMO we have to rethink those before moving forward with the PR.

@@ -131,6 +131,28 @@ public void end(SparkListenerApplicationEnd applicationEnd) {
eventEmitter.emit(event);
}

@Override
public void sqlQuery(String query) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to put that here.

First, there can be multiple Spark SQL queries tied to a single "application" - script - think just

df = spark.sql("select * from a");
df2 = spark.sql("select * from b");
df.join(df2, ...).write() ...

In this case, you'd get two events here with application run id, without ability to tie them to particular Spark Actions.

I think the better idea is to check if we can get something like executionId from Spark Session, and store the SQL in some map keyed by this executionId. Then, attach the SqlJobFacet to the next (all) COMPLETE and RUNNING events send with that executionId.

@Imbruced Imbruced force-pushed the feature/spark/sql-extract branch 2 times, most recently from e4853c0 to a3cdc67 Compare August 13, 2024 11:03
@Imbruced Imbruced force-pushed the feature/spark/sql-extract branch 2 times, most recently from 93c91a5 to e04c9e3 Compare August 19, 2024 08:01
@mobuchowski mobuchowski self-requested a review August 20, 2024 09:41
@Imbruced Imbruced force-pushed the feature/spark/sql-extract branch from e04c9e3 to 1443347 Compare August 20, 2024 10:22
@mobuchowski mobuchowski merged commit 9046ff1 into OpenLineage:main Aug 20, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:integration/spark area:tests Testing code language:java Uses Java programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants