-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solve CDC ordering issues #21
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if we'll regret doing full table scans... but this seems correct!
one-table-bigquery.sql
Outdated
) | ||
WHERE | ||
JSON_VALUE(`_airbyte_data`, '$._ab_cdc_deleted_at') IS NOT NULL | ||
OR JSON_TYPE(JSON_QUERY(`_airbyte_data`, '$._ab_cdc_deleted_at')) = 'null' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be flipped?
OR JSON_TYPE(JSON_QUERY(`_airbyte_data`, '$._ab_cdc_deleted_at')) = 'null' | |
OR JSON_TYPE(JSON_QUERY(`_airbyte_data`, '$._ab_cdc_deleted_at')) != 'null' |
... maybe we can delete the entire JSON_VALUE thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out, in airbytehq/airbyte#28029 I removed all the 'null'(string) checks, because yep ^ that's what we wanted all along. - 7fad41a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are both needed today?
Oops I think I left this comment before you responded, Evan! You can ignore since the code was removed.
This PR does better than #20 in that it solves out-of-order CDC inserts and deletes, even in the case when the PK should come back after a delete.
The main insight is that to keep this performant, we need to do all logical comparisons based on the order of things, by cursor, within the final table. So, we temporarily add back each remaining CDC-deleted row which still has an entry in the raw table so we can consider wether it is newer or older than new records. Then, like always, we'll remove any
deleted_at=true
records at the end of the transaction. We still need to watch out for airbytehq/airbyte#27923, which means we cannot do any joins or cross-compares between the raw and final tables.This all remains fine, because we do all of the above within a transaction the user will never see "for real"