Change from T&D every 10k records to an increasing time based interval #28130

jbfbell · 2023-07-11T00:15:43Z

Instead of Typing and deduping every 10k records, use an increasing interval so users can both see new records quickly at the start of a sync but not slow down the entire sync too much.

Closes #27920

…ping and deduping

github-actions · 2023-07-11T00:16:07Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan and you've followed all steps in the Breaking Changes Checklist
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
The connector tests are passing in CI
You've updated the connector's metadata.yaml file (new!)
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

jbfbell · 2023-07-11T00:24:27Z

...va/io/airbyte/integrations/base/destination/typing_deduping/TypeAndDedupeOperationValve.java

+    // However, as their destination tables grow in size, typing and de-duping data becomes an expensive operation
+    // To strike a balance between showing data quickly and not slowing down the entire sync, we use an increasing
+    // interval based approach. This is not fancy, just hard coded intervals.
+    private static final List<Long> typeAndDedupeIncreasingIntervals = List.of(


These are somewhat arbitrary, open to changing them or making them configurable based on sync type?

evantahler

Commenting only on the logic, not the code - this is a great idea!

evantahler · 2023-07-11T00:46:48Z

...va/io/airbyte/integrations/base/destination/typing_deduping/TypeAndDedupeOperationValve.java

+
+  private static final long TEN_MINUTES_MILLIS = 1000 * 60 * 10;
+
+  private static final long FIFTEEN_MINUTES_MILLIS = 1000 * 60 * 15;


Maybe add a comment that 15min is the max time between checkpoints defined by the protocol.

evantahler · 2023-07-11T00:49:35Z

...query/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java

-    // This is just modular arithmetic written in a complicated way. We want to run T+D every
-    // RECORDS_PER_TYPING_AND_DEDUPING_BATCH records.
-    // TODO this counter should be per stream, not global.
-    if (recordsSinceLastTDRun.getAndUpdate(l -> (l + 1) % RECORDS_PER_TYPING_AND_DEDUPING_BATCH) == RECORDS_PER_TYPING_AND_DEDUPING_BATCH - 1) {


I don't know if getting the current time in Java is slow or not... but there's probably some overheard to checking this every record. We might want to keep the modulo above, and only do this check every N records to save some compute time. Maybe every 100?

Did some testing and grabbing the system time for each record does add a bit of overhead, updated the TypeAndDedupeOperationValve to add a record count check to only check the timing every 100 records

octavia-squidington-iii · 2023-07-11T00:52:41Z

destination-bigquery-denormalized test report (commit `47c4ad4caa`) - ✅

⏲️ Total pipeline duration: 17mn22s

Step	Result
Validate airbyte-integrations/connectors/destination-bigquery-denormalized/metadata.yaml	✅
Connector version semver check	✅
QA checks	✅
Build connector tar	✅
Build destination-bigquery-denormalized docker image for platform linux/x86_64	✅
./gradlew :airbyte-integrations:connectors:destination-bigquery-denormalized:integrationTest	✅

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery-denormalized test

octavia-squidington-iii · 2023-07-11T01:20:49Z

destination-bigquery test report (commit `47c4ad4caa`) - ❌

⏲️ Total pipeline duration: 27mn59s

Step	Result
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅
Build connector tar	✅
Build destination-bigquery docker image for platform linux/x86_64	✅
Build airbyte/normalization:dev	✅
./gradlew :airbyte-integrations:connectors:destination-bigquery:integrationTest	❌

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

octavia-squidington-iii · 2023-07-11T18:05:53Z

destination-bigquery-denormalized test report (commit `43ed6feb3a`) - ✅

⏲️ Total pipeline duration: 17mn30s

Step	Result
Validate airbyte-integrations/connectors/destination-bigquery-denormalized/metadata.yaml	✅
Connector version semver check	✅
QA checks	✅
Build connector tar	✅
Build destination-bigquery-denormalized docker image for platform linux/x86_64	✅
./gradlew :airbyte-integrations:connectors:destination-bigquery-denormalized:integrationTest	✅

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery-denormalized test

octavia-squidington-iii · 2023-07-11T18:07:31Z

destination-bigquery test report (commit `43ed6feb3a`) - ✅

⏲️ Total pipeline duration: 01mn28s

Step	Result
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅
Build connector tar	✅
Build destination-bigquery docker image for platform linux/x86_64	✅
Build airbyte/normalization:dev	✅
./gradlew :airbyte-integrations:connectors:destination-bigquery:integrationTest	✅

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

edgao

had a couple nitpicky comments, but given this is intended as a temporary thing then I don't feel super strongly - we're only planning to keep this until the async work exists for standard inserts, right? (at which point we'll run typing+deduping every time there's new raw data)

(sorry for dumping a bunch of comments right after you merge - none of them are blocking, feel free to ignore)

edgao · 2023-07-11T18:07:10Z

...va/io/airbyte/integrations/base/destination/typing_deduping/TypeAndDedupeOperationValve.java

+ * A slightly more complicated way to keep track of when to perform type and dedupe operations per
+ * stream
+ */
+public class TypeAndDedupeOperationValve extends ConcurrentHashMap<AirbyteStreamNameNamespacePair, Long> {


can we just have a private ConcurrentHashMap instead of extending the class?

Yeah that would probably have been a smarter approach, when I started off I thought that we might want it to be more like a Map but its evolved enough to probably just be its own thing

edgao · 2023-07-11T18:09:07Z

...va/io/airbyte/integrations/base/destination/typing_deduping/TypeAndDedupeOperationValve.java

+   * @return a boolean indicating whether we have crossed the interval threshold for typing and
+   *         deduping.
+   */
+  public boolean readyToTypeAndDedupe(final AirbyteStreamNameNamespacePair key) {


nit: rename to incrementRecordCount (or some other verb name) since we're mutating state here

edgao · 2023-07-11T18:12:27Z

...va/io/airbyte/integrations/base/destination/typing_deduping/TypeAndDedupeOperationValve.java

+  private final Supplier<Long> nowness;
+  private ConcurrentHashMap<AirbyteStreamNameNamespacePair, Long> recordCounts;
+
+  public TypeAndDedupeOperationValve() {


does it work to have this accept a parsedcatalog? I.e. to prefill all the maps, instead of checking containsKey everywhere. And then I think you can delete addStream entirely

edgao · 2023-07-11T18:17:15Z

...va/io/airbyte/integrations/base/destination/typing_deduping/TypeAndDedupeOperationValve.java

+  }
+
+  /**
+   * This constructor is here because mocking System.currentTimeMillis() is a pain :(


we could also receive currentTimeMills as a method param in readyToTypeAndDedupe. Then your tests wouldn't need to have a stateful lambda. E.g. callers would do readyToTypeAndDedupe(key, () -> System.currentTimeMillis()), and tests could just hardcode readyToTypeAndDedupe(key, () -> 42))

kind of a pain to make this change though, your current implementation also works fine 🤷

edgao · 2023-07-11T18:19:29Z

...va/io/airbyte/integrations/base/destination/typing_deduping/TypeAndDedupeOperationValve.java

+  }
+
+  /**
+   * Meant to be called after


why can't this happen inside readyToTypeAndDedupe?

I liked the separation of concerns here. Readiness to type and dedupe and increasing the interval aren't necessarily coupled even if I've stated they're meant to be 😄.

* Revert "Revert "Destination Bigquery: Scaffolding for destinations v2 (#27268)"" This reverts commit 348c577. * version bumps+changelog * Speed up BQ by having 2 queries, and not an OR (#27981) * 🐛 Destination Bigquery: fix bug in standard inserts for syncs >10K records (#27856) * only run t+d code if it's enabled * dockerfile+changelog * remove changelog entry * Destinations V2: handle optional fields for `object` and `array` types (#27898) * catch null schema * fix null properties * clean up * consolidate + add more tests * try catch * empty json test * Automated Commit - Formatting Changes * remove todo * destination bigquery: misc updates to 1s1t code (#28057) * switch to checkedconsumer * add unit test for buildColumnId * use flag * restructure prefix check * fix build * more type-parsing fixes (#28100) * more type-parsing fixes * handle duplicates * Automated Commit - Format and Process Resources Changes * add tests for asColumns * Automated Commit - Format and Process Resources Changes * log warnings instead of throwing exception * better log message * error level --------- Co-authored-by: edgao <[email protected]> * Automated Commit - Formatting Changes * Improve protocol type parsing (#28126) * Automated Commit - Formatting Changes * Change from T&D every 10k records to an increasing time based interval (#28130) * fifteen minute t&d * add typing and deduping operation valve for increased intervals of typing and deduping * Automated Commit - Format and Process Resources Changes * resolve bizarre merge conflict * Automated Commit - Format and Process Resources Changes --------- Co-authored-by: jbfbell <[email protected]> * Simplify and speed up CDC delete support [DestinationsV2] (#28029) * Simplify and speed up CDC delete support [DestinationsV2] * better QUOTE * spotbugs? * recompile dbt image for local arch and use that when building images * things compile, but tests fail * tests working-ish * comment * fix logic to re-insert deleted records for cursor comparison. tests pass! * remove comment * Skip CDC re-include logic if there are no CDC columns * stop hardcoding pk (#28092) * wip * remove TODOs --------- Co-authored-by: Edward Gao <[email protected]> * update method name * Automated Commit - Formatting Changes * depend on pinned normalization version * implement 1s1t DATs for destination-bigquery (#27852) * intiial implementation * Automated Commit - Formatting Changes * add second sync to test * do concurrent things * Automated Commit - Formatting Changes * clarify comment * minor tweaks * more stuff * Automated Commit - Formatting Changes * minor cleanup * lots of fixes * handle sql vs json null better * verify extra columns * only check deleted_at if in DEDUP mode and the column exists * add full refresh append test case * Automated Commit - Formatting Changes * add tests for the remaining sync modes * Automated Commit - Formatting Changes * readability stuff * Automated Commit - Formatting Changes * add test for gcs mode * remove static fields * Automated Commit - Formatting Changes * add more test cases, tweak test scaffold * cleanup * Automated Commit - Formatting Changes * extract recorddiffer * and use it in the sql generator test * fix * comment * naming+comment * one more comment * better assert * remove unnecessary thing * one last thing * Automated Commit - Formatting Changes * enable concurrent execution on all java integration tests * add test for default namespace * Automated Commit - Formatting Changes * implement a 2-stream test * Automated Commit - Formatting Changes * extract methods * invert jsonNodesNotEquivalent * Automated Commit - Formatting Changes * fix conditional * pull out diffSingleRecord * Automated Commit - Formatting Changes * handle nulls correctly * remove raw-specific handling; break up methods * Automated Commit - Formatting Changes --------- Co-authored-by: edgao <[email protected]> Co-authored-by: octavia-approvington <[email protected]> * Destinations V2: move create raw tables earlier (#28255) * move create raw tables * better log message * stop building normalization (#28256) * fix ability to run tests * disable incremental t+d for now * Automated Commit - Formatting Changes --------- Co-authored-by: Evan Tahler <[email protected]> Co-authored-by: Cynthia Yin <[email protected]> Co-authored-by: cynthiaxyin <[email protected]> Co-authored-by: edgao <[email protected]> Co-authored-by: Joe Bell <[email protected]> Co-authored-by: jbfbell <[email protected]> Co-authored-by: octavia-approvington <[email protected]>

* Revert "Revert "Destination Bigquery: Scaffolding for destinations v2 (airbytehq#27268)"" This reverts commit 348c577. * version bumps+changelog * Speed up BQ by having 2 queries, and not an OR (airbytehq#27981) * 🐛 Destination Bigquery: fix bug in standard inserts for syncs >10K records (airbytehq#27856) * only run t+d code if it's enabled * dockerfile+changelog * remove changelog entry * Destinations V2: handle optional fields for `object` and `array` types (airbytehq#27898) * catch null schema * fix null properties * clean up * consolidate + add more tests * try catch * empty json test * Automated Commit - Formatting Changes * remove todo * destination bigquery: misc updates to 1s1t code (airbytehq#28057) * switch to checkedconsumer * add unit test for buildColumnId * use flag * restructure prefix check * fix build * more type-parsing fixes (airbytehq#28100) * more type-parsing fixes * handle duplicates * Automated Commit - Format and Process Resources Changes * add tests for asColumns * Automated Commit - Format and Process Resources Changes * log warnings instead of throwing exception * better log message * error level --------- Co-authored-by: edgao <[email protected]> * Automated Commit - Formatting Changes * Improve protocol type parsing (airbytehq#28126) * Automated Commit - Formatting Changes * Change from T&D every 10k records to an increasing time based interval (airbytehq#28130) * fifteen minute t&d * add typing and deduping operation valve for increased intervals of typing and deduping * Automated Commit - Format and Process Resources Changes * resolve bizarre merge conflict * Automated Commit - Format and Process Resources Changes --------- Co-authored-by: jbfbell <[email protected]> * Simplify and speed up CDC delete support [DestinationsV2] (airbytehq#28029) * Simplify and speed up CDC delete support [DestinationsV2] * better QUOTE * spotbugs? * recompile dbt image for local arch and use that when building images * things compile, but tests fail * tests working-ish * comment * fix logic to re-insert deleted records for cursor comparison. tests pass! * remove comment * Skip CDC re-include logic if there are no CDC columns * stop hardcoding pk (airbytehq#28092) * wip * remove TODOs --------- Co-authored-by: Edward Gao <[email protected]> * update method name * Automated Commit - Formatting Changes * depend on pinned normalization version * implement 1s1t DATs for destination-bigquery (airbytehq#27852) * intiial implementation * Automated Commit - Formatting Changes * add second sync to test * do concurrent things * Automated Commit - Formatting Changes * clarify comment * minor tweaks * more stuff * Automated Commit - Formatting Changes * minor cleanup * lots of fixes * handle sql vs json null better * verify extra columns * only check deleted_at if in DEDUP mode and the column exists * add full refresh append test case * Automated Commit - Formatting Changes * add tests for the remaining sync modes * Automated Commit - Formatting Changes * readability stuff * Automated Commit - Formatting Changes * add test for gcs mode * remove static fields * Automated Commit - Formatting Changes * add more test cases, tweak test scaffold * cleanup * Automated Commit - Formatting Changes * extract recorddiffer * and use it in the sql generator test * fix * comment * naming+comment * one more comment * better assert * remove unnecessary thing * one last thing * Automated Commit - Formatting Changes * enable concurrent execution on all java integration tests * add test for default namespace * Automated Commit - Formatting Changes * implement a 2-stream test * Automated Commit - Formatting Changes * extract methods * invert jsonNodesNotEquivalent * Automated Commit - Formatting Changes * fix conditional * pull out diffSingleRecord * Automated Commit - Formatting Changes * handle nulls correctly * remove raw-specific handling; break up methods * Automated Commit - Formatting Changes --------- Co-authored-by: edgao <[email protected]> Co-authored-by: octavia-approvington <[email protected]> * Destinations V2: move create raw tables earlier (airbytehq#28255) * move create raw tables * better log message * stop building normalization (airbytehq#28256) * fix ability to run tests * disable incremental t+d for now * Automated Commit - Formatting Changes --------- Co-authored-by: Evan Tahler <[email protected]> Co-authored-by: Cynthia Yin <[email protected]> Co-authored-by: cynthiaxyin <[email protected]> Co-authored-by: edgao <[email protected]> Co-authored-by: Joe Bell <[email protected]> Co-authored-by: jbfbell <[email protected]> Co-authored-by: octavia-approvington <[email protected]>

jbfbell added 2 commits July 10, 2023 15:11

fifteen minute t&d

5222e54

add typing and deduping operation valve for increased intervals of ty…

47c4ad4

…ping and deduping

jbfbell requested a review from a team as a code owner July 11, 2023 00:15

octavia-squidington-iii added area/connectors Connector related issues connectors/destination/bigquery labels Jul 11, 2023

jbfbell changed the title ~~joseph.bell/27920/10k records limit~~ Change from T&D every 10k records to an increasing time based interval Jul 11, 2023

jbfbell commented Jul 11, 2023

View reviewed changes

Automated Commit - Format and Process Resources Changes

a3336d0

evantahler approved these changes Jul 11, 2023

View reviewed changes

evantahler reviewed Jul 11, 2023

View reviewed changes

jbfbell and others added 2 commits July 11, 2023 10:40

resolve bizarre merge conflict

43ed6fe

Automated Commit - Format and Process Resources Changes

7aba717

jbfbell merged commit 2dc7d77 into edgao/1s1t_redeploy Jul 11, 2023

jbfbell deleted the joseph.bell/27920/10k-records-limit branch July 11, 2023 18:25

edgao reviewed Jul 11, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change from T&D every 10k records to an increasing time based interval #28130

Change from T&D every 10k records to an increasing time based interval #28130

jbfbell commented Jul 11, 2023 •

edited

Loading

github-actions bot commented Jul 11, 2023

jbfbell Jul 11, 2023

evantahler left a comment

evantahler Jul 11, 2023

jbfbell Jul 11, 2023

evantahler Jul 11, 2023 •

edited

Loading

jbfbell Jul 11, 2023

octavia-squidington-iii commented Jul 11, 2023

octavia-squidington-iii commented Jul 11, 2023

octavia-squidington-iii commented Jul 11, 2023

octavia-squidington-iii commented Jul 11, 2023

edgao left a comment

edgao Jul 11, 2023

jbfbell Jul 11, 2023

edgao Jul 11, 2023

jbfbell Jul 11, 2023

edgao Jul 11, 2023

jbfbell Jul 11, 2023

edgao Jul 11, 2023

edgao Jul 11, 2023

jbfbell Jul 11, 2023


		private static final long TEN_MINUTES_MILLIS = 1000 * 60 * 10;

		private static final long FIFTEEN_MINUTES_MILLIS = 1000 * 60 * 15;

Change from T&D every 10k records to an increasing time based interval #28130

Change from T&D every 10k records to an increasing time based interval #28130

Conversation

jbfbell commented Jul 11, 2023 • edited Loading

github-actions bot commented Jul 11, 2023

Before Merging a Connector Pull Request

Choose a reason for hiding this comment

evantahler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evantahler Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

octavia-squidington-iii commented Jul 11, 2023

destination-bigquery-denormalized test report (commit 47c4ad4caa) - ✅

octavia-squidington-iii commented Jul 11, 2023

destination-bigquery test report (commit 47c4ad4caa) - ❌

octavia-squidington-iii commented Jul 11, 2023

destination-bigquery-denormalized test report (commit 43ed6feb3a) - ✅

octavia-squidington-iii commented Jul 11, 2023

destination-bigquery test report (commit 43ed6feb3a) - ✅

edgao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbfbell commented Jul 11, 2023 •

edited

Loading

evantahler Jul 11, 2023 •

edited

Loading

destination-bigquery-denormalized test report (commit `47c4ad4caa`) - ✅

destination-bigquery test report (commit `47c4ad4caa`) - ❌

destination-bigquery-denormalized test report (commit `43ed6feb3a`) - ✅

destination-bigquery test report (commit `43ed6feb3a`) - ✅