Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Sync - Deduped WITHOUT History #3487

Closed
royt-via opened this issue May 19, 2021 · 9 comments
Closed

Incremental Sync - Deduped WITHOUT History #3487

royt-via opened this issue May 19, 2021 · 9 comments
Labels

Comments

@royt-via
Copy link

royt-via commented May 19, 2021

Tell us about the problem you're trying to solve

I'd like to sync data in an incremental way (records being updated and not appended) but no need for the history table.

Describe the solution you’d like

Having a sync option - Incremental Sync - Deduped (no history)

Describe the alternative you’ve considered or used

Using Incremental Sync - Deduped History

┆Issue is synchronized with this Asana task by Unito

@royt-via royt-via added the type/enhancement New feature or request label May 19, 2021
@ChristopheDuong
Copy link
Contributor

Thanks for the issue!

Could you elaborate on the problem you're trying to solve?

What is the problem of having the history table being generated in the destination? Do you have concerns about storage space being used or are there some other reasons why it is unwanted?

@royt-via
Copy link
Author

Hi @ChristopheDuong, I'm trying to avoid both unused/unnecessary tables and inserts/copy commands on the destination DB

@ajzo90
Copy link
Contributor

ajzo90 commented May 28, 2021

Hi! I would also like this feature.
In addition to @royt-via, I also think that the cleanup details in the documentation is a bit vague and is open for interpretation as it is right now. Am I allowed to clean up the history table immediately for example, effectively eliminate the history mode?
To resolve that I would prefer to have 1) a strict mode where cleanup is not allowed, and 2) deduped without history?

Note that in Incremental Deduped History, the size of the data in your warehouse increases monotonically since an updated record in the source is appended to the destination history table rather than updated in-place as it is done with the final table. If you only care about having the latest snapshot of your data, you may want to periodically run cleanup jobs which retain only the latest instance of each record in the history tables.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented May 28, 2021

Quick implementation:

One approach is to clean up extra rows after normalization:

  • The main piece of work that needs to be done for implementing this is to also purge the _airbyte_raw_* tables that are in history mode by default in order to keep only the latest rows there too.
  • Disabling the generation of the _scd tables is trivial to do but needs the ability to change the generated dbt project settings...

This will be possible to do/implement (by a user) soon thanks to #2959

Long term implementation

Another approach, but not so easy to execute is to always drop completely the _airbyte_raw_* tables after the normalization process ran. This would require an "Incremental batch normalization" (see #2566 for more details)

@ajzo90
Copy link
Contributor

ajzo90 commented May 29, 2021

I think of this proposal as an additional property in the protocol that each destination can opt in to support.
It is possible for a destination to support this feature without the normalization (as I understand the protocol, and since the normalization is optional)

@ajzo90
Copy link
Contributor

ajzo90 commented May 29, 2021

A related use case: to handle late arriving facts (ex https://discourse.getdbt.com/t/on-the-limits-of-incrementality/303)
When a source transfer events with a window, say 3 last days. Noe it is neither a full nor an incremental sync.

In that case It would make sense for the destination to dedup without history.

On the source side it is allowed to use windows, since its follows the "at-least-once" delivery principle.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented May 31, 2021

I think of this proposal as an additional property in the protocol that each destination can opt in to support.
It is possible for a destination to support this feature without the normalization (as I understand the protocol, and since the normalization is optional)

Yes, you are right, there is also the option of a destination handling the sync mode completely on its own without relying on normalization at all.

There is a supported_destination_sync_modes enum in the spec.json of the destination where there could be an enum value to denote what you describe in there.
So even destinations that don't support normalization (or that don't support dbt either), could still implement the sync modes such as dedupe with or without history etc

@ajzo90
Copy link
Contributor

ajzo90 commented May 31, 2021

That would be awesome. It adds some complexity to the protocol and UI.

@davinchia
Copy link
Contributor

This is now possible with Destinations V2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants