Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: batch json decode checkpoint actions when writing to parquet #2983

Merged
merged 1 commit into from
Nov 13, 2024

Conversation

alexwilcoxson-rel
Copy link
Contributor

Description

This change pushes more serialized json actions into the decoder before flushing. For a log with 10s of thousands of actions, the current implementation took ~18 seconds, this change dropped it to 3.

Related Issue(s)

n/a

Documentation

https://docs.rs/arrow-json/53.2.0/arrow_json/reader/struct.Decoder.html#method.decode

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Nov 10, 2024
(cherry picked from commit 12abf00)
Signed-off-by: Alex Wilcoxson <[email protected]>
@alexwilcoxson-rel alexwilcoxson-rel force-pushed the checkpoint-batch-upstream branch from d732035 to 4add802 Compare November 10, 2024 15:43
Copy link

codecov bot commented Nov 10, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.27%. Comparing base (7a3b3ec) to head (4add802).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/protocol/checkpoints.rs 66.66% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2983   +/-   ##
=======================================
  Coverage   72.26%   72.27%           
=======================================
  Files         128      128           
  Lines       40329    40334    +5     
  Branches    40329    40334    +5     
=======================================
+ Hits        29143    29150    +7     
+ Misses       9334     9331    -3     
- Partials     1852     1853    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hntd187
Copy link
Collaborator

hntd187 commented Nov 10, 2024

Does it make sense to tie it to record batch size? Could we instead pull that into it's own configuration or at least constant?

@alexwilcoxson-rel
Copy link
Contributor Author

Does it make sense to tie it to record batch size? Could we instead pull that into it's own configuration or at least constant?

I think if its less than the checkpoint batch size, then the decode and flush will yield a batch < the checkpoint batch size constant. For example

    const CHECKPOINT_RECORD_BATCH_SIZE: usize = 5000;

    let mut decoder = ReaderBuilder::new(arrow_schema)
        .with_batch_size(CHECKPOINT_RECORD_BATCH_SIZE)
        .build_decoder()?;

    // Count of actions
    let mut total_actions = 0;

    for chunk in &jsons.chunks(2500) {
        let mut buf = Vec::new();
        // write 2500 serialized json objects to buffer
        for j in chunk {
            serde_json::to_writer(&mut buf, &j?)?;
            total_actions += 1;
        }
        // internally buffers 2500 objects and returns because
        // buf is exhausted even though 2500 < 5000
        let _ = decoder.decode(&buf)?;
        // flush yields 2500 row batch and writes it
        while let Some(batch) = decoder.flush()? {
            writer.write(&batch)?;
        }
    }

@hntd187
Copy link
Collaborator

hntd187 commented Nov 11, 2024

I understand, I guess I'm asking is there any reason a user might want to configure this independently from the batch size like above? If we don't think it's ever gonna happen then we can forgo configuration.

@rtyler rtyler added this pull request to the merge queue Nov 13, 2024
Merged via the queue into delta-io:main with commit 95395cb Nov 13, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants