Specify Columns to update on incremental Models #1862

richmintz · 2019-10-24T17:26:00Z

Describe the feature

Allow model developer the ability to choose the set of columns that require updating on incremental loads

Who will this benefit?

This will benefit all users that are performing an incremental update of a wide table with few columns that are mutable

drewbanin · 2019-10-28T17:26:12Z

hey @richmintz! Can you say more about this? I think it's a good idea - I'm curious if you're interested in this feature for performance reasons, or something else.

richmintz · 2019-11-04T19:51:22Z

Hi @drewbanin!
My primary goal would be performance, minimizing writes and index rebuilds. However I also think it will generate more concise sql for large tables when a merge is required.

drewbanin · 2019-11-17T17:22:27Z

Got it! Sure, this makes a lot of sense in merge statements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify the update statement instead of the merge statement.

I can imagine a config like update_columns which could be supplied like:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}}

If update_columns is not provided, or if it is null, dbt will default to updating all of the columns in the model.

You buy all of that?

bfil · 2020-01-17T12:04:38Z

As part of this proposed change, it would also be nice to be able to exclude the when matched then update set part of the merge altogether, as in some of my models I'm only interested in adding new rows since the source data is never updated (for event-based data for example or other append-only tables), and it makes the model execution faster (at least in BigQuery).

It could be a separate config no_update = True or just a convention that doesn't include the SQL block when update_columns is empty.

Please note I tested this and would work for BigQuery, other databases might need a different syntax to support no-op for updates.

Any thoughts?

drewbanin · 2020-01-21T03:47:11Z

hey @bfil - if you do not supply a unique_key for your incremental model's config, then dbt should not inject a when matched then update set.... clause to the merge statement.

Check out the implementation here:
https://github.com/fishtown-analytics/dbt/blob/7a07017b96ac332c872725343833a94b49129c68/core/dbt/include/global_project/macros/materializations/common/merge.sql#L23-L48

bfil · 2020-01-21T12:11:00Z

@drewbanin I know, but I still want to compare and insert only the new data based on that unique key rather than merging on an always FALSE condition. Makes sense?

drewbanin · 2020-01-21T15:00:37Z

ah! Sure, that makes a lot of sense. What do you think about pushing this code down from the materialization layer and into the modeling layer. Could you do something like:

{{ config(materialized='incremental') }}

select id, col1, col2 from {{ ref('some_table') }}
{% if is_incremental() %}
where id not in (select id from {{ this }})
{% endif %}

This is a kind of funny adaptation of the typical incremental modeling approach, but it should serve to only select the new records from your source table, inserting them directly into the destination table.

ee07dazn · 2020-06-08T13:17:40Z

@drewbanin : any update on where we are with this feature...i have a wide table [Snowflake] but i only need to update very few columns in the incremental model. any ETA's on the feature ?

drewbanin · 2020-06-08T16:33:42Z

hey @ee07dazn - no movement on this issue on our end to report. I think the spec I mentioned above (pasted below) is still a good idea:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}

I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach. If anyone is interested in working on this one, I'd be happy to try to point you in the right direction.

The key change here will be to replace the call to adapter.get_columns_in_relation with config.get('update_columns'). We'll want to implement this for both the Snowflake and BigQuery incremental materializations when the merge strategy is used.

ee07dazn · 2020-06-08T22:14:41Z

Thanks @drewbanin for the information and a pointer. Apart from what you had suggested, i had to make change to the default__get_merge_sql macro to make that approach work. It works now but i am just not happy to make a change to something as low level as default__get_merge_sql. Probably i think i can update the snowflake_get_merge_sql to do this job. Thinking out loud...but thanks for the help.

ee07dazn · 2020-06-09T16:07:39Z

Hi @drewbanin : i have another question and please bear with me if its a silly one as i am new to the tool and still exploring it through a POC.

So i need to update the incremental materialisation to make sure i can use the "update_column" in config but rest of the materialisation does not need to change.

{% materialization incremental, adapter='snowflake' -%}

  {%- set unique_key = config.get('unique_key') -%}
  {%- set full_refresh_mode = config.get('should_full_refresh') -%}

  {% set target_relation = this %}
  {% set existing_relation = load_relation(this) %}
  {% set tmp_relation = make_temp_relation(this) %}

  {#-- Validate early so we don't run SQL if the strategy is invalid --#}
  {% set strategy = dbt_snowflake_validate_get_incremental_strategy(config) -%}

  -- setup
  {{ run_hooks(pre_hooks, inside_transaction=False) }}

  -- `BEGIN` happens here:
  {{ run_hooks(pre_hooks, inside_transaction=True) }}

  {% if existing_relation is none %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% elif existing_relation.is_view %}
    {#-- Can't overwrite a view with a table - we must drop --#}
    {{ log("Dropping relation " ~ target_relation ~ " because it is a view and this model is a table.") }}
    {% do adapter.drop_relation(existing_relation) %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% elif full_refresh_mode %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% else %}
    {% do run_query(create_table_as(True, tmp_relation, sql)) %}
    {% do adapter.expand_target_column_types(
           from_relation=tmp_relation,
           to_relation=target_relation) %}
    {% set dest_columns = config.get('update_columns') %}
    {% set build_sql = dbt_snowflake_get_incremental_sql(strategy, tmp_relation, target_relation, unique_key, dest_columns) %}
  {% endif %}

  {%- call statement('main') -%}
    {{ build_sql }}
  {%- endcall -%}

  {{ run_hooks(post_hooks, inside_transaction=True) }}

  -- `COMMIT` happens here
  {{ adapter.commit() }}

  {{ run_hooks(post_hooks, inside_transaction=False) }}

  {% set target_relation = target_relation.incorporate(type='table') %}
  {% do persist_docs(target_relation, model) %}
  {{ return({'relations': [target_relation]}) }}
{%- endmaterialization %}

In this code, while the sql query gets executed correctly, i get the following error :

"Compilation Error in macro materialization_incremental_snowflake (macros/incremental.sql)
  'persist_docs' is undefined"

which is the 3rd last line. I want the default macro to be executed if i have not redefined a particular as in this case, i don't need to.

How can i tell dbt to execute the default persist_docs macro in this model ? If you prefer, i can create it as a seperate issue but felt it was tied to the original question.

ee07dazn · 2020-06-10T09:38:19Z

I did redefine the macro "persist_docs" in my project and made the change in the materialisation to point to my macro. But now i get
" 'dict object' has no attribute 'default__persist_docs' "

UPDATE : I updated to v0.17.0. and there is no issue any longer. Thanks anyways guys for help.

drewbanin · 2020-06-12T02:04:58Z

ha! Ok @ee07dazn - glad to hear that you got it working in 0.17.0! I think we should still proceed with getting a feature like this merged into dbt. If you're interested in opening a PR we'd be happy to help, otherwise, someone else might be able to pick it up in the future :)

ee07dazn · 2020-07-15T11:45:52Z

@drewbanin : sorry, missed this comment. I will open a PR soon for this and tag you on it.

cbheri · 2020-08-11T14:29:15Z

@drewbanin any update on implemeting

Got it! Sure, this makes a lot of sense in merge statements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify the update statement instead of the merge statement.

I can imagine a config like update_columns which could be supplied like:
{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}}
If update_columns is not provided, or if it is null, dbt will default to updating all of the columns in the model.

You buy all of that?

hey @ee07dazn - no movement on this issue on our end to report. I think the spec I mentioned above (pasted below) is still a good idea:
{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}
I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach. If anyone is interested in working on this one, I'd be happy to try to point you in the right direction.

The key change here will be to replace the call to adapter.get_columns_in_relation with config.get('update_columns'). We'll want to implement this for both the Snowflake and BigQuery incremental materializations when the merge strategy is used.

@drewbanin any updates on implementing the update_columns feature?

prratek · 2021-01-13T20:32:05Z

@ee07dazn if you haven't had a chance to work on this yet, mind if I take a stab at it? @drewbanin other than the change to adapter.get_columns_in_relation, are there changes I'd have to make so that dbt accepts the new config parameter?

drewbanin · 2021-02-09T13:56:23Z

hey folks! No update from my side unfortunately, but please do feel free to open up a PR or follow up below if we want to jam on potential solutions together.

Just re-reading this thread and saw:

I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach.

I do think that's still the right approach, meaning that this feature should only be accessible on BigQuery/Snowflake and not Postgres/Redshift, for instance. @prratek I do think that replacing a call to get_columns_in_relation with config.get('update_columns') should work well! It's been a while since I've touched this part of the codebase though, so there might be some other quirks that come up in testing.

@prratek if you'd like, feel free to drop in a draft PR with some example changes and we can take things from there too!

prratek · 2021-02-13T16:33:09Z

Okay here's a draft PR. We'd probably want to default to adapter.get_columns_in_relation if the update_columns config parameter isn't specified, right?

jtcohen6 · 2021-02-15T08:59:28Z

Thanks for the draft PR @prratek! I can leave comments over there.

richmintz added enhancement New feature or request triage labels Oct 24, 2019

drewbanin removed the triage label Oct 28, 2019

drewbanin added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Jun 8, 2020

prratek mentioned this issue Feb 13, 2021

Gets columns to update from config for BQ and Snowflake #3100

Merged

4 tasks

jtcohen6 closed this as completed in #3100 Mar 22, 2021

jtcohen6 added this to the Margaret Mead milestone Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify Columns to update on incremental Models #1862

Specify Columns to update on incremental Models #1862

richmintz commented Oct 24, 2019

drewbanin commented Oct 28, 2019

richmintz commented Nov 4, 2019

drewbanin commented Nov 17, 2019

bfil commented Jan 17, 2020

drewbanin commented Jan 21, 2020

bfil commented Jan 21, 2020 •

edited

Loading

drewbanin commented Jan 21, 2020

ee07dazn commented Jun 8, 2020 •

edited

Loading

drewbanin commented Jun 8, 2020

ee07dazn commented Jun 8, 2020

ee07dazn commented Jun 9, 2020 •

edited by drewbanin

Loading

ee07dazn commented Jun 10, 2020 •

edited

Loading

drewbanin commented Jun 12, 2020

ee07dazn commented Jul 15, 2020

cbheri commented Aug 11, 2020

prratek commented Jan 13, 2021

drewbanin commented Feb 9, 2021

prratek commented Feb 13, 2021

jtcohen6 commented Feb 15, 2021

Specify Columns to update on incremental Models #1862

Specify Columns to update on incremental Models #1862

Comments

richmintz commented Oct 24, 2019

Describe the feature

Who will this benefit?

drewbanin commented Oct 28, 2019

richmintz commented Nov 4, 2019

drewbanin commented Nov 17, 2019

bfil commented Jan 17, 2020

drewbanin commented Jan 21, 2020

bfil commented Jan 21, 2020 • edited Loading

drewbanin commented Jan 21, 2020

ee07dazn commented Jun 8, 2020 • edited Loading

drewbanin commented Jun 8, 2020

ee07dazn commented Jun 8, 2020

ee07dazn commented Jun 9, 2020 • edited by drewbanin Loading

ee07dazn commented Jun 10, 2020 • edited Loading

drewbanin commented Jun 12, 2020

ee07dazn commented Jul 15, 2020

cbheri commented Aug 11, 2020

prratek commented Jan 13, 2021

drewbanin commented Feb 9, 2021

prratek commented Feb 13, 2021

jtcohen6 commented Feb 15, 2021

bfil commented Jan 21, 2020 •

edited

Loading

ee07dazn commented Jun 8, 2020 •

edited

Loading

ee07dazn commented Jun 9, 2020 •

edited by drewbanin

Loading

ee07dazn commented Jun 10, 2020 •

edited

Loading