Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of delete+insert incremental strategy #151

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ataft
Copy link

@ataft ataft commented Apr 10, 2024

resolves #150
resolves #364

Problem

The delete query for the 'delete+insert' incremental_strategy with 2+ unique_key columns is VERY inefficient. In many cases, it will hang and never return for deleting small amounts of data (<100K rows).

Solution

Improve the query by switching to a much more efficient delete strategy:

delete from table1
where (col1, col2) in (
    select distinct col1, col2 from table1_tmp
)

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development, and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX

resolves dbt-labs#150

Problem
The delete query for the 'delete+insert' incremental_strategy with 2+ unique_key columns is VERY inefficient. In many cases, it will hang and never return for deleting small amounts of data (<100K rows).

Solution
Improve the query by switching to a much more efficient delete strategy:

```
delete from table1
where (col1, col2) in (
    select distinct col1, col2 from table1_tmp
)
```
@Fleid
Copy link
Contributor

Fleid commented Apr 18, 2024

Hey @ataft, thank you for opening this here. Would you be comfortable writing tests for this PR?

@ataft
Copy link
Author

ataft commented Apr 22, 2024

@Fleid The existing tests should cover this. However, the issue with the original logic is that it technically works, but only for small amounts of data. Therefore, the tests do not catch the issue. To truly test, you need a database and ~100K rows. I'm not sure what dbt's strategy is for this.

@dbeatty10 dbeatty10 added the feature:performance Issues related to improving dbt's performance label May 15, 2024
@dbeatty10 dbeatty10 changed the title Fix incremental delete+insert SQL Improve performance of delete+insert incremental strategy May 15, 2024
using {{ source }}
where (
{% for key in unique_key %}
{{ source }}.{{ key }} = {{ target }}.{{ key }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterallenwebb and I took a look at this PR today.

This line of code is updated by #110, so we think that PR should be reviewed/merged prior to reviewing this PR further.

@colin-rogers-dbt colin-rogers-dbt added the feature:incremental Issues related to incremental materializations label Aug 9, 2024
@colin-rogers-dbt colin-rogers-dbt requested a review from a team as a code owner August 9, 2024 00:16
@cla-bot cla-bot bot added the cla:yes The PR author has signed the CLA label Aug 9, 2024
@colin-rogers-dbt colin-rogers-dbt added the community A PR, or an issue with a PR, from a community member label Aug 9, 2024
@dbeatty10
Copy link
Contributor

Here's the commands I'm using to do testing on this PR:

gh pr checkout 151
git push origin ataft/main

This created this branch in the dbt Labs org: ataft/main.

Then we can use that branch within individual GHA workflows for each adapter by following the process described here: #372 (comment).

Here is the result:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes The PR author has signed the CLA community A PR, or an issue with a PR, from a community member feature:incremental Issues related to incremental materializations feature:performance Issues related to improving dbt's performance
Projects
None yet
5 participants