Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure a sequential write to DM by always using a single output #898

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

thorkildcognite
Copy link
Contributor

We want to go slow to go fast with data modeling, and never write with more than one partition at a time. This change ensures this by repartitioning down to 1 partition before emitting the result.

Since we almost always have 200 partitions going into the write, we would always clamp this down to the ideal, which was set to 1. But, if we actually had less than 50 partitions going into this, it would not be repartitioned and we would actually try to write to DMS with 50 parallel writes (potentially).

I first didn't find any way to trigger this, but I think I can trigger it by reading directly from time series, sequences etc., which have 50 or smaller as default partitions.

This is very hard to verify in production, but there should be no clear downside to ensuring this.

We want to go slow to go fast with data modeling, and never write with
more than one partition at a time. This ensures this by repartitioning
down to 1 partition before emitting the result.

Since we almost always had 200 partitions going into the write, we
would always clamp this down to the ideal, which was set to 1. But, if
we actually had less than 50 partitions going into this, it would not
be repartitioned and we would actually try to write to DMS with 50
parallel writes.

I first didn't find any way to trigger this, but I think I can trigger
it by reading directly from time series, sequences etc., which have 50
or smaller as default partitions.

This is very hard to verify in production, but there should be no
clear downside to ensuring this.
@thorkildcognite thorkildcognite requested a review from a team as a code owner February 8, 2024 01:59
Copy link

codecov bot commented Feb 8, 2024

Codecov Report

Merging #898 (992c7c3) into master (b77b607) will increase coverage by 0.00%.
The diff coverage is 75.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #898   +/-   ##
=======================================
  Coverage   80.76%   80.77%           
=======================================
  Files          43       43           
  Lines        2839     2840    +1     
  Branches      115      121    +6     
=======================================
+ Hits         2293     2294    +1     
  Misses        546      546           
Files Coverage Δ
...rc/main/scala/cognite/spark/v1/DefaultSource.scala 93.08% <75.00%> (+0.03%) ⬆️

@thorkildcognite thorkildcognite marked this pull request as draft February 8, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant