Extract and send events with raw sql from spark. #2913

Imbruced · 2024-08-05T16:19:29Z

Problem

Currently for spark application we don't send any information about used sql query, for example
spark.sql(
"SELECT id FROM table"
).save......

will track that we used table and saved that somewhere but the raw sql won't be recorded in any event

"SELECT id FROM table"

structure.

Closes: #ISSUE-NUMBER

Solution

We can look into QuerExecution children up we find the sqlQuery, the downside of this using bfs algorithm is we keep only one query. So in case like

spark.sql("SELECT 1 as id").createOrReplaceTempView("tab1")
spark.sql("SELECT 2 as id").createOrReplaceTempView("tab2")

spark.sql("select * from tab1 union all select * from tab2")

we store only query select * from tab1 union all select * from tab2

that's the limitation of sql query facet where we can store only one value.

Field which we are looking for is available only for spark 3.3 and above

One-line summary:

Traversing QueryExecution to get sql field with bfs algorithm.

Checklist

You've signed-off your work
Your pull request title follows our guidelines
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2024 contributors to the OpenLineage project

dolfinus · 2024-08-09T12:45:10Z

Please update PR description and title

mobuchowski

Just did only high level review with two items - because IMO we have to rethink those before moving forward with the PR.

mobuchowski · 2024-08-12T11:37:49Z

...app/src/main/java/io/openlineage/spark/agent/lifecycle/SparkApplicationExecutionContext.java

@@ -131,6 +131,28 @@ public void end(SparkListenerApplicationEnd applicationEnd) {
    eventEmitter.emit(event);
  }

+  @Override
+  public void sqlQuery(String query) {


I don't think it makes sense to put that here.

First, there can be multiple Spark SQL queries tied to a single "application" - script - think just

df = spark.sql("select * from a"); df2 = spark.sql("select * from b"); df.join(df2, ...).write() ...

In this case, you'd get two events here with application run id, without ability to tie them to particular Spark Actions.

I think the better idea is to check if we can get something like executionId from Spark Session, and store the SQL in some map keyed by this executionId. Then, attach the SqlJobFacet to the next (all) COMPLETE and RUNNING events send with that executionId.

integration/spark/app/src/test/java/io/openlineage/spark/agent/SparkTestsUtils.java

...n/spark/app/src/main/java/io/openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java

Signed-off-by: pawelkocinski <[email protected]>

boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Aug 5, 2024

Imbruced force-pushed the feature/spark/sql-extract branch 2 times, most recently from 69d029b to cbb5e15 Compare August 7, 2024 08:56

Imbruced marked this pull request as ready for review August 7, 2024 10:59

Imbruced requested a review from a team as a code owner August 7, 2024 10:59

Imbruced force-pushed the feature/spark/sql-extract branch from 246e79a to 36d8e5d Compare August 7, 2024 14:44

Imbruced changed the title ~~Feature/spark/sql extract~~ Extract and send events with raw sql from spark. Aug 12, 2024

mobuchowski requested changes Aug 12, 2024

View reviewed changes

Imbruced force-pushed the feature/spark/sql-extract branch 2 times, most recently from e4853c0 to a3cdc67 Compare August 13, 2024 11:03

Imbruced commented Aug 13, 2024

View reviewed changes

...n/spark/app/src/main/java/io/openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java Show resolved Hide resolved

Imbruced force-pushed the feature/spark/sql-extract branch 2 times, most recently from 93c91a5 to e04c9e3 Compare August 19, 2024 08:01

mobuchowski approved these changes Aug 20, 2024

View reviewed changes

mobuchowski self-requested a review August 20, 2024 09:41

Add events with raw spark sql query.

1443347

Signed-off-by: pawelkocinski <[email protected]>

Imbruced force-pushed the feature/spark/sql-extract branch from e04c9e3 to 1443347 Compare August 20, 2024 10:22

mobuchowski merged commit 9046ff1 into OpenLineage:main Aug 20, 2024
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract and send events with raw sql from spark. #2913

Extract and send events with raw sql from spark. #2913

Imbruced commented Aug 5, 2024 •

edited

Loading

dolfinus commented Aug 9, 2024

mobuchowski left a comment

mobuchowski Aug 12, 2024

Extract and send events with raw sql from spark. #2913

Extract and send events with raw sql from spark. #2913

Conversation

Imbruced commented Aug 5, 2024 • edited Loading

Problem

Solution

One-line summary:

Checklist

dolfinus commented Aug 9, 2024

mobuchowski left a comment

Choose a reason for hiding this comment

mobuchowski Aug 12, 2024

Choose a reason for hiding this comment

Imbruced commented Aug 5, 2024 •

edited

Loading