You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of column lineage extraction does not address real world applications as far as I can tell.
The two most common applications I can think of are:
Impact analysis when changing definition of columns or considering deletion
Understanding definition of columns
Both of these applications require extracting indirect column dependencies from e.g. WHERE, GROUP BY, IF and CASE WHEN clauses. As far as I can tell this is not possible at the moment (unless I'm missing a configuration option).
Example 1) (this is actually from the ExtractColumnLevelLineage.java examples):
UPDATE `bigquery-public-data.samples.wikipedia` W
SET title = S.corpus, comment = S.word
FROM (SELECT corpus, UPPER(word) AS word FROM `bigquery-public-data.samples.shakespeare`) S
WHERE W.title = S.corpus;
The * indicates an indirect dependency (of course the output is generated by the user, so zetasql would have to provide some attribute for indirect relationships). Both columns indirectly depend on corpus and title. The definition of comment changes, if we make changes to either corpus or title in upstream tables. I would like to be aware of that when I make changes to those columns.
Another two made up examples:
Example 2) GROUP BY:
Query:
CREATE TABLE foo.bar AS
SELECT
SUM(word_count) words,
corpus,
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
Lineage:
foo.bar.corpus
<- bigquery-public-data.samples.shakespeare.corpus
foo.bar.words
<- bigquery-public-data.samples.shakespeare.word_count
Please add configuration options for detecting these indirect dependencies, otherwise I don't see anyone adapting column level lineage which could be extremely powerful.
The text was updated successfully, but these errors were encountered:
You bring up a very good point, thank you for reporting this!
I originally envisioned the column lineage feature to focus only on columns that were used to write to others directly. But now that you bring this discussion up; I agree that we should have visibility on filters, groups, etc.
I'll need to look into it a bit to give you a timeline. My first impression is that it'd require decent refactor of how lineage is currently implemented. But I'd like to have this as well!
The current implementation of column lineage extraction does not address real world applications as far as I can tell.
The two most common applications I can think of are:
Both of these applications require extracting indirect column dependencies from e.g.
WHERE
,GROUP BY
,IF
andCASE WHEN
clauses. As far as I can tell this is not possible at the moment (unless I'm missing a configuration option).Example 1) (this is actually from the
ExtractColumnLevelLineage.java
examples):extracts:
when I would like to see:
The * indicates an indirect dependency (of course the output is generated by the user, so zetasql would have to provide some attribute for indirect relationships). Both columns indirectly depend on
corpus
andtitle
. The definition ofcomment
changes, if we make changes to eithercorpus
ortitle
in upstream tables. I would like to be aware of that when I make changes to those columns.Another two made up examples:
Example 2)
GROUP BY
:where I would like to see:
Example 3)
IF
:where I would like to see:
Please add configuration options for detecting these indirect dependencies, otherwise I don't see anyone adapting column level lineage which could be extremely powerful.
The text was updated successfully, but these errors were encountered: