-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put BufWriter into TrackedWrite #3361
Conversation
let path = env::temp_dir().join("arrow_writer.temp"); | ||
let file = File::create(path).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to write batch to file.
@@ -45,16 +45,17 @@ use crate::schema::types::{ | |||
|
|||
/// A wrapper around a [`Write`] that keeps track of the number | |||
/// of bytes that have been written | |||
pub struct TrackedWrite<W> { | |||
inner: W, | |||
pub struct TrackedWrite<W: Write> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could also update the doc comment to mention the addition of a BufWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When writing to memory, does the additional BufWriter have an appreciable performance cost? That would be my only concern, otherwise this seems like a substantial win 👍
I also double-checked that large writes, such as when writing an entire page, will skip the buffer - https://doc.rust-lang.org/src/std/io/buffered/bufwriter.rs.html#364
Yea, on the benchmark of writing to in-memory writer, the additional BufWriter is slower. However I think file writer is preferred as this is for Parquet writer so I assume the practice usage is file-based instead of in-memory which is basically for test purpose. |
IOx writes to memory prior to flushing to object storage... How bad is the regression? |
For no-bloom filter ones, the regression is slightly less than 6.9%, but for bool types, it is 50~143% improvement. For bloom filter ones, 54% (string non-null with bloom filter) regression is worst one.
|
Updated: By implementing some functions of Still regression ~25%: string bloom filter, string dictionary with bloom filter, string non-null with bloom filter.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can live with a slight regression for writing to memory, I am still somewhat surprised by it as the page data should be large enough to skip the buffer, but perhaps the benchmarks are just unfortunately sized
Benchmark runs are scheduled for baseline = 2cf4abb and contender = e2abb4b. e2abb4b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3366.
Rationale for this change
To improve writing performance.
What changes are included in this PR?
Are there any user-facing changes?