Feature: duplicate-proof replication using record hashes in `STATE` #161

MeltyBot · 2021-07-06T18:51:05Z

Possible Spec

Taps can accept a new deduplication_mode config option (or similar), which allows a few possible options:

No deduplication: deduplication_mode=DEFAULT - (Default.) Same as today, records do not pass through a dedupe handler.
Efficient deduplication: deduplication_mode=INCREMENTAL - All incremental streams will be deduped. This is the most efficient balance, because only 'ties' need to be tracked.
Full deduplication: deduplication_mode=FULL_TABLE - All streams, including incremental and full table streams will be deduped. This is the most expensive option, because all record keys must be hashed. This will break many taps if the volume is large, but it could work extremely well in cases where the number of records is <10K, such as spreadsheet applications or message delivery applications.

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/162

Originally created by @aaronsteers on 2021-07-06 18:51:05

Background / Discussion

As part of the Singer spec, and also as a best practice to avoid records being skipped, taps are implemented with a greater-than-or-equal-to comparison against the replication key value.

The "or equal to" part of the comparison is confusing to new users but the reason for it is important: to ensure record "ties" are never omitted from the final target.

Existing mitigations

The current solution for this problem is (1) use primary key upserts on the target side, which will naturally solve the duplication problem or (2) use a solution like dbt to remove duplicates downstream.

Proposed improvement

Continue, as today, to use "greater than or equal to" logic.
Add to the STATE dictionary a new record_hashes_seen property (or similar) as a list of record hashes having the same value as the max replication key value.
- As we parse each record during a stream, assuming its value is equal to the max replication key (which will be true for every record if the stream is sorted), then we can store a hash of the record into STATE along with the max replication key value.
Next time through the sync, if the hash of a record exactly matches a hash in the set of record_hashes_seen, we can omit that record and not send it downstream.
Any other ties by replication_key_value will be emitted downstream to the target, assuming their hash has not yet been seen/sent.
When writing out STATE messages, only the latest "ties" need to be included in the record_hashes_seen. This would not be a cumulative list of all records, only the latest ties by replication key.

Since the SDK entirely handles the State implementation for SDK-based taps, we have an opportunity to build this as a more robust solution across all taps using the SDK.

Best reasons not to build

The reasons not to build this are (1) performance, (2) complexity, and (3) scalability.

The complexity argument can be mitigated by the fact that these are all aspects managed entirely in the SDK, so developers and users can in general completely ignore these internal state treatments.

Regarding performance, the hashing of a record should be able to be performed very quickly, and then it is just a matter of tuning the caching and variable comparison logic. This should be tunable to reach satisfactory performance, but if not, we could also mitigate by enabling as an optional setting, such as dedupe_incremental_streams (bool), or similar.

Regarding scalability, this solution should scale fine assuming a small number "ties" by replication key value. If the number of ties is in the thousands or larger, however, this could have adverse affects on the stability of STATE messages. An option to disable the behavior via settings (as described in the previous paragraph, could prove useful for this as well. That said, presumably the larger the number of ties (within reason), the higher the value of this deduplication capability.

Regarding adherance to Singer Spec

To my knowledge, this implementation would still adhere to the spec since (1) the STATE behavior is entirely up to the tap to control, (2) we still accomplish >= logic for replication keys, (3) we still guarantee that every record will be sent at least once.

The text was updated successfully, but these errors were encountered:

MeltyBot · 2022-05-29T23:45:02Z

View 6 previous comments from the original issue on GitLab

aaronsteers · 2022-11-16T22:50:34Z

@pnadolny13 - FYI, I updated the description above to include a proposed spec and I've add the Accepting Pull Requests label, and I've added to our Office Hours 'Up Next' list.

Related to:

[Feature]: Support toggling >= incremental logic for SQL taps #1200

stale · 2023-07-18T05:09:34Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

pnadolny13 · 2023-07-19T12:42:33Z

Still relevant

stale · 2024-07-18T13:18:32Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

edgarrmondragon · 2024-08-08T21:50:53Z

I see the 👎 in the bot's post so I'll reopen 😅 . Feel free to comment with how this would benefit your workflow or even better, some implementation ideas :)

MeltyBot mentioned this issue Sep 9, 2021

Improve catalog performance - [merged] #531

Closed

labelsync-manager bot added the kind/Feature New feature or request label Jun 23, 2022

tayloramurphy removed the kind::feature label Jun 24, 2022

aaronsteers changed the title ~~Spec proposal for duplicate-proof incremental replication~~ Feature: duplicate-proof incremental replication Nov 11, 2022

aaronsteers changed the title ~~Feature: duplicate-proof incremental replication~~ Feature: duplicate-proof replication Nov 11, 2022

aaronsteers mentioned this issue Nov 16, 2022

[Feature]: Support toggling >= incremental logic for SQL taps #1200

Open

aaronsteers changed the title ~~Feature: duplicate-proof replication~~ Feature: duplicate-proof replication using record hashes Nov 16, 2022

aaronsteers changed the title ~~Feature: duplicate-proof replication using record hashes~~ Feature: duplicate-proof replication using record hashes in STATE Nov 16, 2022

aaronsteers added the Accepting Pull Requests label Nov 16, 2022

aaronsteers added this to Office Hours Nov 16, 2022

aaronsteers moved this to To Discuss in Office Hours Nov 16, 2022

aaronsteers moved this from To Discuss to Up Next in Office Hours Nov 16, 2022

pnadolny13 mentioned this issue Nov 17, 2022

Non-Team Repo Activity - Meltano/MeltanoLabs meltano/squared#479

Closed

aaronsteers moved this from Up Next to Discussed in Office Hours Nov 30, 2022

aaronsteers mentioned this issue Jan 31, 2023

Incremental Syncs for APIs Without Filtering #226

Open

pnadolny13 mentioned this issue Feb 8, 2023

Meltano contribution slack message improvements meltano/squared#555

Open

aaronsteers mentioned this issue Mar 22, 2023

Implement State Capability MeltanoLabs/tap-csv#13

Open

stale bot added the stale label Jul 18, 2023

pnadolny13 removed the stale label Jul 19, 2023

edgarrmondragon mentioned this issue Dec 15, 2023

feat: incremental sync support MeltanoLabs/tap-flipkart#7

Open

stale bot added the stale label Jul 18, 2024

stale bot closed this as completed Aug 8, 2024

edgarrmondragon reopened this Aug 8, 2024

edgarrmondragon removed the stale label Aug 8, 2024

edgarrmondragon added the evergreen label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: duplicate-proof replication using record hashes in `STATE` #161

Feature: duplicate-proof replication using record hashes in `STATE` #161

MeltyBot commented Jul 6, 2021 •

edited by aaronsteers

Loading

MeltyBot commented May 29, 2022

aaronsteers commented Nov 16, 2022

stale bot commented Jul 18, 2023

pnadolny13 commented Jul 19, 2023

stale bot commented Jul 18, 2024

edgarrmondragon commented Aug 8, 2024 •

edited

Loading

Feature: duplicate-proof replication using record hashes in STATE #161

Feature: duplicate-proof replication using record hashes in STATE #161

Comments

MeltyBot commented Jul 6, 2021 • edited by aaronsteers Loading

Possible Spec

Background / Discussion

Existing mitigations

Proposed improvement

Best reasons not to build

Regarding adherance to Singer Spec

MeltyBot commented May 29, 2022

aaronsteers commented Nov 16, 2022

stale bot commented Jul 18, 2023

pnadolny13 commented Jul 19, 2023

stale bot commented Jul 18, 2024

edgarrmondragon commented Aug 8, 2024 • edited Loading

Feature: duplicate-proof replication using record hashes in `STATE` #161

Feature: duplicate-proof replication using record hashes in `STATE` #161

MeltyBot commented Jul 6, 2021 •

edited by aaronsteers

Loading

edgarrmondragon commented Aug 8, 2024 •

edited

Loading