Ensure a sequential write to DM by always using a single output #898

thorkildcognite · 2024-02-08T01:59:07Z

We want to go slow to go fast with data modeling, and never write with more than one partition at a time. This change ensures this by repartitioning down to 1 partition before emitting the result.

Since we almost always have 200 partitions going into the write, we would always clamp this down to the ideal, which was set to 1. But, if we actually had less than 50 partitions going into this, it would not be repartitioned and we would actually try to write to DMS with 50 parallel writes (potentially).

I first didn't find any way to trigger this, but I think I can trigger it by reading directly from time series, sequences etc., which have 50 or smaller as default partitions.

This is very hard to verify in production, but there should be no clear downside to ensuring this.

We want to go slow to go fast with data modeling, and never write with more than one partition at a time. This ensures this by repartitioning down to 1 partition before emitting the result. Since we almost always had 200 partitions going into the write, we would always clamp this down to the ideal, which was set to 1. But, if we actually had less than 50 partitions going into this, it would not be repartitioned and we would actually try to write to DMS with 50 parallel writes. I first didn't find any way to trigger this, but I think I can trigger it by reading directly from time series, sequences etc., which have 50 or smaller as default partitions. This is very hard to verify in production, but there should be no clear downside to ensuring this.

codecov · 2024-02-08T02:08:00Z

Codecov Report

Merging #898 (992c7c3) into master (b77b607) will increase coverage by 0.00%.
The diff coverage is 75.00%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #898   +/-   ##
=======================================
  Coverage   80.76%   80.77%           
=======================================
  Files          43       43           
  Lines        2839     2840    +1     
  Branches      115      121    +6     
=======================================
+ Hits         2293     2294    +1     
  Misses        546      546

Files	Coverage Δ
...rc/main/scala/cognite/spark/v1/DefaultSource.scala	`93.08% <75.00%> (+0.03%)`	⬆️

thorkildcognite requested a review from a team as a code owner February 8, 2024 01:59

thorkildcognite marked this pull request as draft February 8, 2024 09:29

thorkildcognite added the DO-NOT-MERGE label Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure a sequential write to DM by always using a single output #898

Ensure a sequential write to DM by always using a single output #898

thorkildcognite commented Feb 8, 2024

codecov bot commented Feb 8, 2024 •

edited

Loading

Ensure a sequential write to DM by always using a single output #898

Are you sure you want to change the base?

Ensure a sequential write to DM by always using a single output #898

Conversation

thorkildcognite commented Feb 8, 2024

codecov bot commented Feb 8, 2024 • edited Loading

Codecov Report

codecov bot commented Feb 8, 2024 •

edited

Loading