Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve CDC ordering issues #21

Merged
merged 3 commits into from
Jul 10, 2023
Merged

Solve CDC ordering issues #21

merged 3 commits into from
Jul 10, 2023

Conversation

evantahler
Copy link
Contributor

@evantahler evantahler commented Jul 10, 2023

This PR does better than #20 in that it solves out-of-order CDC inserts and deletes, even in the case when the PK should come back after a delete.

The main insight is that to keep this performant, we need to do all logical comparisons based on the order of things, by cursor, within the final table. So, we temporarily add back each remaining CDC-deleted row which still has an entry in the raw table so we can consider wether it is newer or older than new records. Then, like always, we'll remove any deleted_at=true records at the end of the transaction. We still need to watch out for airbytehq/airbyte#27923, which means we cannot do any joins or cross-compares between the raw and final tables.

This all remains fine, because we do all of the above within a transaction the user will never see "for real"

@evantahler evantahler requested a review from alex-gron July 10, 2023 20:21
@evantahler evantahler marked this pull request as ready for review July 10, 2023 20:21
@evantahler evantahler requested a review from edgao July 10, 2023 20:23
@evantahler evantahler changed the title solve CDC ordering issues Solve CDC ordering issues Jul 10, 2023
Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we'll regret doing full table scans... but this seems correct!

)
WHERE
JSON_VALUE(`_airbyte_data`, '$._ab_cdc_deleted_at') IS NOT NULL
OR JSON_TYPE(JSON_QUERY(`_airbyte_data`, '$._ab_cdc_deleted_at')) = 'null'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be flipped?

Suggested change
OR JSON_TYPE(JSON_QUERY(`_airbyte_data`, '$._ab_cdc_deleted_at')) = 'null'
OR JSON_TYPE(JSON_QUERY(`_airbyte_data`, '$._ab_cdc_deleted_at')) != 'null'

... maybe we can delete the entire JSON_VALUE thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out, in airbytehq/airbyte#28029 I removed all the 'null'(string) checks, because yep ^ that's what we wanted all along. - 7fad41a

Copy link

@alex-gron alex-gron Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are both needed today?

Oops I think I left this comment before you responded, Evan! You can ignore since the code was removed.

@evantahler evantahler merged commit 7602c36 into main Jul 10, 2023
@evantahler evantahler deleted the evan/cdc-delete-update branch July 10, 2023 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants