Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify Columns to update on incremental Models #1862

Closed
richmintz opened this issue Oct 24, 2019 · 19 comments · Fixed by #3100
Closed

Specify Columns to update on incremental Models #1862

richmintz opened this issue Oct 24, 2019 · 19 comments · Fixed by #3100
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Milestone

Comments

@richmintz
Copy link

Describe the feature

Allow model developer the ability to choose the set of columns that require updating on incremental loads

Who will this benefit?

This will benefit all users that are performing an incremental update of a wide table with few columns that are mutable

@richmintz richmintz added enhancement New feature or request triage labels Oct 24, 2019
@drewbanin drewbanin removed the triage label Oct 28, 2019
@drewbanin
Copy link
Contributor

hey @richmintz! Can you say more about this? I think it's a good idea - I'm curious if you're interested in this feature for performance reasons, or something else.

@richmintz
Copy link
Author

Hi @drewbanin!
My primary goal would be performance, minimizing writes and index rebuilds. However I also think it will generate more concise sql for large tables when a merge is required.

@drewbanin
Copy link
Contributor

Got it! Sure, this makes a lot of sense in merge statements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify the update statement instead of the merge statement.

I can imagine a config like update_columns which could be supplied like:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}}

If update_columns is not provided, or if it is null, dbt will default to updating all of the columns in the model.

You buy all of that?

@bfil
Copy link

bfil commented Jan 17, 2020

As part of this proposed change, it would also be nice to be able to exclude the when matched then update set part of the merge altogether, as in some of my models I'm only interested in adding new rows since the source data is never updated (for event-based data for example or other append-only tables), and it makes the model execution faster (at least in BigQuery).

It could be a separate config no_update = True or just a convention that doesn't include the SQL block when update_columns is empty.

Please note I tested this and would work for BigQuery, other databases might need a different syntax to support no-op for updates.

Any thoughts?

@drewbanin
Copy link
Contributor

hey @bfil - if you do not supply a unique_key for your incremental model's config, then dbt should not inject a when matched then update set.... clause to the merge statement.

Check out the implementation here:
https://github.com/fishtown-analytics/dbt/blob/7a07017b96ac332c872725343833a94b49129c68/core/dbt/include/global_project/macros/materializations/common/merge.sql#L23-L48

@bfil
Copy link

bfil commented Jan 21, 2020

@drewbanin I know, but I still want to compare and insert only the new data based on that unique key rather than merging on an always FALSE condition. Makes sense?

@drewbanin
Copy link
Contributor

ah! Sure, that makes a lot of sense. What do you think about pushing this code down from the materialization layer and into the modeling layer. Could you do something like:

{{ config(materialized='incremental') }}

select id, col1, col2 from {{ ref('some_table') }}
{% if is_incremental() %}
where id not in (select id from {{ this }})
{% endif %}

This is a kind of funny adaptation of the typical incremental modeling approach, but it should serve to only select the new records from your source table, inserting them directly into the destination table.

@ee07dazn
Copy link

ee07dazn commented Jun 8, 2020

@drewbanin : any update on where we are with this feature...i have a wide table [Snowflake] but i only need to update very few columns in the incremental model. any ETA's on the feature ?

@drewbanin
Copy link
Contributor

hey @ee07dazn - no movement on this issue on our end to report. I think the spec I mentioned above (pasted below) is still a good idea:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}

I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach. If anyone is interested in working on this one, I'd be happy to try to point you in the right direction.

The key change here will be to replace the call to adapter.get_columns_in_relation with config.get('update_columns'). We'll want to implement this for both the Snowflake and BigQuery incremental materializations when the merge strategy is used.

@drewbanin drewbanin added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Jun 8, 2020
@ee07dazn
Copy link

ee07dazn commented Jun 8, 2020

Thanks @drewbanin for the information and a pointer. Apart from what you had suggested, i had to make change to the default__get_merge_sql macro to make that approach work. It works now but i am just not happy to make a change to something as low level as default__get_merge_sql. Probably i think i can update the snowflake_get_merge_sql to do this job. Thinking out loud...but thanks for the help.

@ee07dazn
Copy link

ee07dazn commented Jun 9, 2020

Hi @drewbanin : i have another question and please bear with me if its a silly one as i am new to the tool and still exploring it through a POC.

So i need to update the incremental materialisation to make sure i can use the "update_column" in config but rest of the materialisation does not need to change.

{% materialization incremental, adapter='snowflake' -%}

  {%- set unique_key = config.get('unique_key') -%}
  {%- set full_refresh_mode = config.get('should_full_refresh') -%}

  {% set target_relation = this %}
  {% set existing_relation = load_relation(this) %}
  {% set tmp_relation = make_temp_relation(this) %}

  {#-- Validate early so we don't run SQL if the strategy is invalid --#}
  {% set strategy = dbt_snowflake_validate_get_incremental_strategy(config) -%}

  -- setup
  {{ run_hooks(pre_hooks, inside_transaction=False) }}

  -- `BEGIN` happens here:
  {{ run_hooks(pre_hooks, inside_transaction=True) }}

  {% if existing_relation is none %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% elif existing_relation.is_view %}
    {#-- Can't overwrite a view with a table - we must drop --#}
    {{ log("Dropping relation " ~ target_relation ~ " because it is a view and this model is a table.") }}
    {% do adapter.drop_relation(existing_relation) %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% elif full_refresh_mode %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% else %}
    {% do run_query(create_table_as(True, tmp_relation, sql)) %}
    {% do adapter.expand_target_column_types(
           from_relation=tmp_relation,
           to_relation=target_relation) %}
    {% set dest_columns = config.get('update_columns') %}
    {% set build_sql = dbt_snowflake_get_incremental_sql(strategy, tmp_relation, target_relation, unique_key, dest_columns) %}
  {% endif %}

  {%- call statement('main') -%}
    {{ build_sql }}
  {%- endcall -%}

  {{ run_hooks(post_hooks, inside_transaction=True) }}

  -- `COMMIT` happens here
  {{ adapter.commit() }}

  {{ run_hooks(post_hooks, inside_transaction=False) }}

  {% set target_relation = target_relation.incorporate(type='table') %}
  {% do persist_docs(target_relation, model) %}
  {{ return({'relations': [target_relation]}) }}
{%- endmaterialization %}

In this code, while the sql query gets executed correctly, i get the following error :

"Compilation Error in macro materialization_incremental_snowflake (macros/incremental.sql)
  'persist_docs' is undefined"

which is the 3rd last line. I want the default macro to be executed if i have not redefined a particular as in this case, i don't need to.

How can i tell dbt to execute the default persist_docs macro in this model ? If you prefer, i can create it as a seperate issue but felt it was tied to the original question.

@ee07dazn
Copy link

ee07dazn commented Jun 10, 2020

I did redefine the macro "persist_docs" in my project and made the change in the materialisation to point to my macro. But now i get
" 'dict object' has no attribute 'default__persist_docs' "

UPDATE : I updated to v0.17.0. and there is no issue any longer. Thanks anyways guys for help.

@drewbanin
Copy link
Contributor

ha! Ok @ee07dazn - glad to hear that you got it working in 0.17.0! I think we should still proceed with getting a feature like this merged into dbt. If you're interested in opening a PR we'd be happy to help, otherwise, someone else might be able to pick it up in the future :)

@ee07dazn
Copy link

@drewbanin : sorry, missed this comment. I will open a PR soon for this and tag you on it.

@cbheri
Copy link

cbheri commented Aug 11, 2020

@drewbanin any update on implemeting

Got it! Sure, this makes a lot of sense in merge statements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify the update statement instead of the merge statement.

I can imagine a config like update_columns which could be supplied like:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}}

If update_columns is not provided, or if it is null, dbt will default to updating all of the columns in the model.

You buy all of that?

hey @ee07dazn - no movement on this issue on our end to report. I think the spec I mentioned above (pasted below) is still a good idea:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}

I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach. If anyone is interested in working on this one, I'd be happy to try to point you in the right direction.

The key change here will be to replace the call to adapter.get_columns_in_relation with config.get('update_columns'). We'll want to implement this for both the Snowflake and BigQuery incremental materializations when the merge strategy is used.

@drewbanin any updates on implementing the update_columns feature?

@prratek
Copy link
Contributor

prratek commented Jan 13, 2021

@ee07dazn if you haven't had a chance to work on this yet, mind if I take a stab at it? @drewbanin other than the change to adapter.get_columns_in_relation, are there changes I'd have to make so that dbt accepts the new config parameter?

@drewbanin
Copy link
Contributor

hey folks! No update from my side unfortunately, but please do feel free to open up a PR or follow up below if we want to jam on potential solutions together.

Just re-reading this thread and saw:

I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach.

I do think that's still the right approach, meaning that this feature should only be accessible on BigQuery/Snowflake and not Postgres/Redshift, for instance. @prratek I do think that replacing a call to get_columns_in_relation with config.get('update_columns') should work well! It's been a while since I've touched this part of the codebase though, so there might be some other quirks that come up in testing.

@prratek if you'd like, feel free to drop in a draft PR with some example changes and we can take things from there too!

@prratek
Copy link
Contributor

prratek commented Feb 13, 2021

Okay here's a draft PR. We'd probably want to default to adapter.get_columns_in_relation if the update_columns config parameter isn't specified, right?

@jtcohen6
Copy link
Contributor

Thanks for the draft PR @prratek! I can leave comments over there.

@jtcohen6 jtcohen6 added this to the Margaret Mead milestone Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants