Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that dbt-teradata works with dbt-expectations package #27

Open
adamtwo opened this issue Feb 4, 2022 · 3 comments
Open

Validate that dbt-teradata works with dbt-expectations package #27

adamtwo opened this issue Feb 4, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@adamtwo
Copy link
Contributor

adamtwo commented Feb 4, 2022

Describe the feature

dbt users frequently rely on https://github.com/calogica/dbt-expectations. We need to ensure that dbt-teradata works well with this dbt package.

Describe alternatives you've considered

Don't depend on existing dbt libraries and always create your own.

Additional context

This came up in a discussion with a large Telco customer who is using dbt with Teradata.

Who will this benefit?

Users who want to leverage https://github.com/calogica/dbt-expectations in their dbt projects.

Are you interested in contributing this feature?

Yes.

@adamtwo adamtwo added the enhancement New feature or request label Feb 4, 2022
@sambloom92
Copy link

Has this issue been looked at?

@adamtwo
Copy link
Contributor Author

adamtwo commented Jun 7, 2023

Hi @sambloom92 , we have not investigated it yet. Right now, we prioritize catching up with dbt releases. If you have tried it and have feedback it would be very helpful.

@sambloom92
Copy link

sambloom92 commented Jun 8, 2023

Ok I have been writing some tests using dbt-teradata and dbt-expectations for a work project, so I will post all the issues I find and the code I've used to fix them.

These are the tests I've tried so far (I won't try them all as there are lots, but these are the ones I've needed so far):

  1. expect_column_values_to_be_between
  2. expect_column_values_to_be_of_type
  3. expect_column_values_to_be_of_type_list
  4. expect_column_values_lengths_to_equal
  5. expect_values_to_match_regex

The main fix that I needed to implement to make most of these work was to create a teradata__truth_expression macro, as the default__truth_expression macro relies on a boolean type existing, which of course teradata does not have, so I've wrapped it in a CASE WHEN and cast it to a BYTEINT type (luckily the truth expression uses adapter.dispatch, so it's a very simple fix):

{% macro teradata__truth_expression(expression) %}
  case when {{ expression }} then CAST(1 AS BYTEINT) else CAST(0 AS BYTEINT) end as expression
{% endmacro %}

This then requires that each test that calls dbt_expectations.expression_is_true must be changed so that it uses the optional test_condition argument to use the BYTEINT value instead of letting it default to '= true':

{{ dbt_expectations.expression_is_true(model,
                                        expression=expression_min_max,
                                        group_by_columns=group_by_columns,
                                        row_condition=row_condition,
                                        test_condition="= 1")
                                        }}

Unfortunately there are a lot of tests which use dbt_expectations.expression_is_true, but most of them don't use adapter.dispatch, so it's not possible to just write a teradata__ prefixed implementation and call it a day- you have to completely override them by re-writing the test locally.

Ideally all of the tests in dbt_expectations would be implemented such that the test calls a macro of the same name, and this macro then uses adapter.dispatch, with a default__ prefixed implementation. Then you wouldn't have to override anything, you could just write a teradata__ prefixed implementation, without having to re-write any existing code, and adapter.dispatch will call the correct version for the given adapter.

I may suggest this as an enhancement to the dbt-expectations package, as it won't change anything functionally, but should make it easier for other dbt adapaters to maintain compatibility. Here is an example of what I mean:

{% test expect_column_values_to_match_regex(model, column_name,
                                                    regex,
                                                    row_condition=None,
                                                    is_raw=False,
                                                    flags=""
                                                    ) %}

{% set expression %}
{{ dbt_expectations.regexp_instr(column_name, regex, is_raw=is_raw, flags=flags) }} > 0
{% endset %}

{{ dbt_expectations.expression_is_true(model,
                                        expression=expression,
                                        group_by_columns=None,
                                        row_condition=row_condition
                                        )
                                        }}

{% endtest %}

would ideally be refactored to:

{% test expect_column_values_to_match_regex(model, column_name,
                                                    regex,
                                                    row_condition=None,
                                                    is_raw=False,
                                                    flags=""
                                                    ) %}
{{ expect_column_values_to_match_regex(model, column_name, regex, row_condition, is_raw,flags) }}
{% endtest %}

{% macro expect_column_values_to_match_regex(model, column_name,
                                                    regex,
                                                    row_condition=None,
                                                    is_raw=False,
                                                    flags=""
                                                    ) %}
{{adapter.dispatch('expect_column_values_to_match_regex', 'dbt_expectations')(model, column_name, regex, row_condition, is_raw,flags)}}
{% endmacro %}

{% macro default__expect_column_values_to_match_regex(model, column_name,
                                                    regex,
                                                    row_condition=None,
                                                    is_raw=False,
                                                    flags=""
                                                    ) %}
{% set expression %}
{{ dbt_expectations.regexp_instr(column_name, regex, is_raw=is_raw, flags=flags) }} > 0
{% endset %}
{{ dbt_expectations.expression_is_true(model,
                                        expression=expression,
                                        group_by_columns=None,
                                        row_condition=row_condition,
                                        )
                                        }}
{% endmacro %}

So then all you'd need to do to make it teradata compatible is provide this macro in the dbt-teradata adapater:

{% macro teradata__expect_column_values_to_match_regex(model, column_name,
                                                    regex,
                                                    row_condition=None,
                                                    is_raw=False,
                                                    flags=""
                                                    ) %}
{% set expression %}
{{ dbt_expectations.regexp_instr(column_name, regex, is_raw=is_raw, flags=flags) }} > 0
{% endset %}
{{ dbt_expectations.expression_is_true(model,
                                        expression=expression,
                                        group_by_columns=None,
                                        row_condition=row_condition,
                                        test_condition="= 1"
                                        )
                                        }}       
{% endmacro %}

@adamtwo do you think this change would be a reasonable request for the dbt-expectations maintainers? Or is there an easier way that I'm missing which would not require them to change anything on their side?

Finally, expect_column_values_to_be_in_type_list also has an incompatibility issue to do with teradata requiring SELECT statements in UNIONs to select from a table. My fix for this was to just add an empty 'dummy' table and select from this. There was also an issue with trying to cast things as TEXT, so I just changed it to VARCHAR(50) (hopefully that should be long enough for all the data types).
Here is the default implementation:

{% macro default__expect_column_values_to_be_in_type_list(model, column_name, column_type_list) %}
{%- if execute -%}

    {%- set column_name = column_name | upper -%}
    {%- set columns_in_relation = adapter.get_columns_in_relation(model) -%}
    {%- set column_type_list = column_type_list| map("upper") | list -%}
    with relation_columns as (

        {% for column in columns_in_relation %}
        select
            cast('{{ escape_single_quotes(column.name | upper) }}' as {{ dbt.type_string() }}) as relation_column,
            cast('{{ column.dtype | upper }}' as {{ dbt.type_string() }}) as relation_column_type
        {% if not loop.last %}union all{% endif %}
        {% endfor %}
    ),
    test_data as (

        select
            *
        from
            relation_columns
        where
            relation_column = '{{ column_name }}'
            and
            relation_column_type not in ('{{ column_type_list | join("', '") }}')

    )
    select *
    from test_data

{%- endif -%}
{% endmacro %}

And the teradata implementation:

{% macro teradata__expect_column_values_to_be_in_type_list(model, column_name, column_type_list) %}     
{%- if execute -%}

    {%- set column_name = column_name | upper -%}
    {%- set columns_in_relation = adapter.get_columns_in_relation(model) -%}
    {%- set column_type_list = column_type_list| map("upper") | list -%}
    with dummy_table as (select null as dummy_column),
    relation_columns as (

        {% for column in columns_in_relation %}
        select
            '{{ escape_single_quotes(column.name | upper) }}' as relation_column,
           CAST('{{ column.dtype | upper }}' AS VARCHAR(50)) as relation_column_type
            from dummy_table 
        {% if not loop.last %}union all{% endif %}
        {% endfor %}
    ),
    test_data as (

        select
            *
        from
            relation_columns
        where
            relation_column = '{{ column_name }}'
            and
            relation_column_type not in ('{{ column_type_list | join("', '") }}')

    )
    select *
    from test_data

{%- endif -%}
{% endmacro %}

TLDR;

The main issues i've discovered are:

  • No boolean type in teradata, so anywhere with expression = true needs to be adapted to use a byteint type instead
  • Most tests don't use adapter.dispatch so making them teradata compatible requires completely overriding them, rather than just providing a teradata__ prefixed implementation of the specific part that causes the incompatibility. I am suggesting dbt-expectations change this on their side, to make it easier for adapaters to maintain compatibility
  • teradata requires SELECT statements in UNIONs to select from a table, so add a dummy table.
  • teradata has no TEXT type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants