Extract column lineage from indirect dependencies #27

Chr96er · 2023-11-09T19:07:28Z

The current implementation of column lineage extraction does not address real world applications as far as I can tell.

The two most common applications I can think of are:

Impact analysis when changing definition of columns or considering deletion
Understanding definition of columns

Both of these applications require extracting indirect column dependencies from e.g. WHERE, GROUP BY, IF and CASE WHEN clauses. As far as I can tell this is not possible at the moment (unless I'm missing a configuration option).

Example 1) (this is actually from the ExtractColumnLevelLineage.java examples):

UPDATE `bigquery-public-data.samples.wikipedia` W
   SET title = S.corpus, comment = S.word
FROM (SELECT corpus, UPPER(word) AS word FROM `bigquery-public-data.samples.shakespeare`) S
WHERE W.title = S.corpus;

extracts:

Lineage:
bigquery-public-data.samples.wikipedia.comment
                <- bigquery-public-data.samples.shakespeare.word
bigquery-public-data.samples.wikipedia.title
                <- bigquery-public-data.samples.shakespeare.corpus

when I would like to see:

Lineage:
bigquery-public-data.samples.wikipedia.comment
                <- bigquery-public-data.samples.shakespeare.word
                <- bigquery-public-data.samples.shakespeare.corpus*
                <- bigquery-public-data.samples.wikipedia.title*
bigquery-public-data.samples.wikipedia.title
                <- bigquery-public-data.samples.shakespeare.corpus
                <- bigquery-public-data.samples.wikipedia.title*

The * indicates an indirect dependency (of course the output is generated by the user, so zetasql would have to provide some attribute for indirect relationships). Both columns indirectly depend on corpus and title. The definition of comment changes, if we make changes to either corpus or title in upstream tables. I would like to be aware of that when I make changes to those columns.

Another two made up examples:
Example 2) GROUP BY:

Query:
CREATE TABLE foo.bar AS
SELECT
    SUM(word_count) words,
    corpus,
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;

Lineage:
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus
foo.bar.words
                <- bigquery-public-data.samples.shakespeare.word_count

where I would like to see:

Lineage:
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus
foo.bar.words
                <- bigquery-public-data.samples.shakespeare.word_count
                <- bigquery-public-data.samples.shakespeare.corpus*

Example 3) IF:

Query:
CREATE TABLE foo.bar AS
SELECT
    SUM(IF(word_count > 10, 1, 0)) word_count_gt_ten,
    corpus,
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;

Lineage:
foo.bar.word_count_gt_ten
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus

where I would like to see:

Lineage:
foo.bar.word_count_gt_ten
                <- bigquery-public-data.samples.shakespeare.word_count*
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus

Please add configuration options for detecting these indirect dependencies, otherwise I don't see anyone adapting column level lineage which could be extremely powerful.

The text was updated successfully, but these errors were encountered:

ppaglilla · 2023-11-10T03:08:57Z

You bring up a very good point, thank you for reporting this!

I originally envisioned the column lineage feature to focus only on columns that were used to write to others directly. But now that you bring this discussion up; I agree that we should have visibility on filters, groups, etc.

I'll need to look into it a bit to give you a timeline. My first impression is that it'd require decent refactor of how lineage is currently implemented. But I'd like to have this as well!

Chr96er · 2023-11-10T21:36:43Z

Thanks, would be amazing to have that feature!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract column lineage from indirect dependencies #27

Extract column lineage from indirect dependencies #27

Chr96er commented Nov 9, 2023

ppaglilla commented Nov 10, 2023 •

edited

Loading

Chr96er commented Nov 10, 2023

Extract column lineage from indirect dependencies #27

Extract column lineage from indirect dependencies #27

Comments

Chr96er commented Nov 9, 2023

ppaglilla commented Nov 10, 2023 • edited Loading

Chr96er commented Nov 10, 2023

ppaglilla commented Nov 10, 2023 •

edited

Loading