Inconsistent URN casing in DBT ingestion #7377

alex-magno · 2023-02-20T10:16:19Z

Describe the bug
DBT ingestion deals with three main URNs during ingestion: the model name, the table name in source platform (bigquery, snowflake, etc.) and the column name. During ingestion, the table identifier is getting lowercased and the others are not.

model_name preserves what is being used at the manifest:

datahub/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_core.py

Lines 138 to 148 in 1df806d

    
           name = manifest_node["name"] 
        
           if use_identifiers and manifest_node.get("identifier"): 
        
               name = manifest_node["identifier"] 
        
           if ( 
        
               manifest_node.get("alias") is not None 
        
               and manifest_node.get("resource_type") 
        
               != "test"  # tests have non-human-friendly aliases, so we don't want to use it for tests 
        
           ): 
        
               name = manifest_node["alias"]

table identifier (db_fqn) is being lowercased:

datahub/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py

Line 400 in aa388f0

db_fqn = db_fqn.lower()
column URN preserves what is being used at the catalog:

datahub/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_core.py

Line 113 in 1df806d

name=catalog_column["name"],

I'm using DBT with Snowflake and this is causing two bug scenarios:

When I do snowflake ingestion with convert_urns_to_lowercase=False:
Then I get a mismatch between the table identifier urn coming from snowflake (not being lowercased) with the table identifier from DBT (db_fqn), that is actually being lowercased. So nodes are not matching.
When I do snowflake ingestion with convert_urns_to_lowercase=True:
Now table identifiers are matching (both URNs are lowercased), but then the column identifiers mismatch. Because the snowflake ingestion will convert the column URNs to lowercase, while DBT preserves column casing. And the result is a schema view with duplicated columns (lower and uppercase).

To Reproduce
Steps to reproduce the behavior:

Ingest DBT using snowflake as the target_platform
Ingest snowflake metadata with convert_urns_to_lowercase=False. Observe the first bug described.
Ingest snowflake metadata with convert_urns_to_lowercase=True. Observe the second bug described.

Expected behavior
Columns and tables identifiers should match between DBT and source platform. Ideally, casing in DBT should be consistent - either lowercasing everything or preserving case in every identifier.

As I suggestion, I can submit a PR to introduce a convert_urns_to_lowercase flag to the DBT recipe as well, so users can decide if they want to lowercase or not every identifier. At least to make the behavior consistent.

Screenshots

Column names being duplicated when using convert_urns_to_lowercase=True in Snowflake ingestion.

The text was updated successfully, but these errors were encountered:

remisalmon · 2023-03-06T22:05:56Z

I am having the same issue starting with v0.10.0 where dbt and Snowflake columns are showing twice in upper and lower cases (with convert_urns_to_lowercase=true in Snowflake config), this looks like a regression that was introduced by #7063? Not having this issue with v0.9.5.

alex-magno added the bug Bug report label Feb 20, 2023

alex-magno mentioned this issue Feb 23, 2023

fix(ingest/dbt): introduce lowercase column urn option #7418

Merged

5 tasks

hsheth2 closed this as completed in #7418 Mar 20, 2023

viplazylmht mentioned this issue Apr 19, 2023

dbt ingestion always lowercasing the table identifier urns #7853

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent URN casing in DBT ingestion #7377

Inconsistent URN casing in DBT ingestion #7377

alex-magno commented Feb 20, 2023 •

edited

Loading

remisalmon commented Mar 6, 2023

Inconsistent URN casing in DBT ingestion #7377

Inconsistent URN casing in DBT ingestion #7377

Comments

alex-magno commented Feb 20, 2023 • edited Loading

remisalmon commented Mar 6, 2023

alex-magno commented Feb 20, 2023 •

edited

Loading