Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract column lineage from indirect dependencies #27

Open
Chr96er opened this issue Nov 9, 2023 · 2 comments
Open

Extract column lineage from indirect dependencies #27

Chr96er opened this issue Nov 9, 2023 · 2 comments

Comments

@Chr96er
Copy link

Chr96er commented Nov 9, 2023

The current implementation of column lineage extraction does not address real world applications as far as I can tell.

The two most common applications I can think of are:

  1. Impact analysis when changing definition of columns or considering deletion
  2. Understanding definition of columns

Both of these applications require extracting indirect column dependencies from e.g. WHERE, GROUP BY, IF and CASE WHEN clauses. As far as I can tell this is not possible at the moment (unless I'm missing a configuration option).

Example 1) (this is actually from the ExtractColumnLevelLineage.java examples):

UPDATE `bigquery-public-data.samples.wikipedia` W
   SET title = S.corpus, comment = S.word
FROM (SELECT corpus, UPPER(word) AS word FROM `bigquery-public-data.samples.shakespeare`) S
WHERE W.title = S.corpus;

extracts:

Lineage:
bigquery-public-data.samples.wikipedia.comment
                <- bigquery-public-data.samples.shakespeare.word
bigquery-public-data.samples.wikipedia.title
                <- bigquery-public-data.samples.shakespeare.corpus

when I would like to see:

Lineage:
bigquery-public-data.samples.wikipedia.comment
                <- bigquery-public-data.samples.shakespeare.word
                <- bigquery-public-data.samples.shakespeare.corpus*
                <- bigquery-public-data.samples.wikipedia.title*
bigquery-public-data.samples.wikipedia.title
                <- bigquery-public-data.samples.shakespeare.corpus
                <- bigquery-public-data.samples.wikipedia.title*

The * indicates an indirect dependency (of course the output is generated by the user, so zetasql would have to provide some attribute for indirect relationships). Both columns indirectly depend on corpus and title. The definition of comment changes, if we make changes to either corpus or title in upstream tables. I would like to be aware of that when I make changes to those columns.

Another two made up examples:
Example 2) GROUP BY:

Query:
CREATE TABLE foo.bar AS
SELECT
    SUM(word_count) words,
    corpus,
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;

Lineage:
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus
foo.bar.words
                <- bigquery-public-data.samples.shakespeare.word_count

where I would like to see:

Lineage:
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus
foo.bar.words
                <- bigquery-public-data.samples.shakespeare.word_count
                <- bigquery-public-data.samples.shakespeare.corpus*

Example 3) IF:

Query:
CREATE TABLE foo.bar AS
SELECT
    SUM(IF(word_count > 10, 1, 0)) word_count_gt_ten,
    corpus,
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;

Lineage:
foo.bar.word_count_gt_ten
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus

where I would like to see:

Lineage:
foo.bar.word_count_gt_ten
                <- bigquery-public-data.samples.shakespeare.word_count*
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus

Please add configuration options for detecting these indirect dependencies, otherwise I don't see anyone adapting column level lineage which could be extremely powerful.

@ppaglilla
Copy link
Collaborator

ppaglilla commented Nov 10, 2023

You bring up a very good point, thank you for reporting this!

I originally envisioned the column lineage feature to focus only on columns that were used to write to others directly. But now that you bring this discussion up; I agree that we should have visibility on filters, groups, etc.

I'll need to look into it a bit to give you a timeline. My first impression is that it'd require decent refactor of how lineage is currently implemented. But I'd like to have this as well!

@Chr96er
Copy link
Author

Chr96er commented Nov 10, 2023

Thanks, would be amazing to have that feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants