Introducing SQL Data Lineage, a powerful package designed to simplify SQL query analysis. This versatile tool parses data lineage from individual SQL queries or builds comprehensive lineage from multiple queries. It offers both an interactive command-line interface and programmatic integration, making it easy to incorporate into your Python projects.
SQL Data Lineage performs detailed column-level analysis, tracing data flows step-by-step through tables, CTEs, subqueries, and more. It generates user-friendly lineage graphs that clearly show how columns move and transform across SQL components.
You can easily enhance your lineage insights by retrieving and customizing metadata to fit your specific requirements.
We welcome and encourage contributions to the SQL Data Lineage project!
pip install sql-datalineage
Show help of CLI commands.
datalineage --help
Generate data lineage of a sql file, output type is mermaid.
$> datalineage -i docs/example/test-query.sql --schema-path docs/example/test-schema.json -r mermaid
%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
graph LR
subgraph 2420861448752 ["Table: catalog.schema1.customer AS customer"]
2420861447168["id"]
2420861446976["name"]
2420861448464["phone"]
2420860590112["address"]
2420861446304["location"]
end
subgraph 2420861448224 ["CTE: cte1 AS t1"]
2420861448848["id"]
2420861449040["name"]
2420861448272["phone"]
2420861449184["address"]
end
2420861447168 --> 2420861448848
2420861446976 --> 2420861449040
2420861448464 --> 2420861448272
2420860590112 --> 2420861449184
....
Tip
The output of the above command is truncated. You
can optionally save the command's result to a file
using the -o
option.
datalineage -i /docs/example/test-query.sql --schema-path docs/example/test-schema.json -o docs/example/output.mermaid -r mermaid
You can preview the above result using Mermaid Live Editor, here is the result:
You can import datalineage into your project and generate the lineage tree directly.
>>> from datalineage.lineage import lineage
>>> sql = """select
id, name, phone, address
from (
select id, name, phone, address,
row_number() over(partition by phone order by name) as rn
from `catalog.schema1.customer`) data
where data.rn = 1
"""
>>> schema = None # we will infer the schema of table when no schema are provided
>>> dialect = "bigquery"
>>> tree = lineage(sql, dialect, schema)
>>> tree
Node<{"name": "myroot", "expression": "ANCHOR",...
You can traversal and print out the lineage tree in this way:
>>> def print_node(node):
>>> print("Node:", node.name)
>>> list(map(lambda c: print("Column:", c.name), node.children))
>>> for node in tree.walk():
... print_node(node)
...
Node: myroot
Node: _output_
Column: id
Column: name
Column: phone
Column: address
Node: data
Column: id
Column: name
Column: phone
Column: address
Column: rn
Node: "catalog"."schema1"."customer" AS "customer"
Column: id
Column: name
Column: phone
Column: address
Or you can render the tree to a format you like, for example, mermaid.
>>> from datalineage.renderer import MermaidRenderer
>>> renderer = MermaidRenderer()
>>> print(renderer.render(tree))
%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
graph LR
subgraph 1434247920720 ["Table: catalog.schema1.customer AS customer"]
1434247920624["id"]
1434247921104["name"]
1434247919568["phone"]
1434247921200["address"]
end
subgraph 1434247919280 ["Subquery: data"]
1434247920816["id"]
1434247919856["name"]
1434247917696["phone"]
1434247917744["address"]
1434247918224["rn"]
end
1434247920624 --> 1434247920816
1434247921104 --> 1434247919856
1434247919568 --> 1434247917696
1434247921200 --> 1434247917744
1434247919568 --> 1434247918224
1434247921104 --> 1434247918224
subgraph 1434247918032 ["Select: _output_"]
1434247921392["id"]
1434247921344["name"]
1434247921152["phone"]
1434247920912["address"]
end
1434247920816 --> 1434247921392
1434247919856 --> 1434247921344
1434247917696 --> 1434247921152
1434247917744 --> 1434247920912
>>>
Tip
You can render to json format using datalineage.renderer.JsonRenderer
class, or customize your own renderer.
If you are in enviroment which support Ipython (for example, jupyter notebook), you can render the mermaid graph directly:
from datalineage.renderer import MermaidRenderer, MermaidType
html_renderer = MermaidRenderer(output_type=MermaidType.HTML)
html_output = html_renderer.render(tree)
We use uv to manage the project. Please follow the official document to install uv to your environment.
Tip
Your environment does not require any python or pip installed, but if you already have pip, you can quickly install uv like this:
pip install uv
Install pre-commit to your local git hooks.
make install-pre-commit
make style
make test
make check
uv run --no-project -- datalineage --help