-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use bytes
in parquet rather than custom Buffer implementation (#1474)
#1683
Conversation
@@ -207,7 +207,7 @@ impl LevelDecoder { | |||
let num_bytes = | |||
ceil((num_buffered_values * bit_width as usize) as i64, 8); | |||
let data_size = cmp::min(num_bytes as usize, data.len()); | |||
decoder.reset(data.range(data.start(), data_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a bug as data.range already takes into account the start offset
pub fn with_range(mut self, start: usize, len: usize) -> Self { | ||
self.set_range(start, len); | ||
self | ||
} | ||
|
||
/// Updates this buffer with new `start` position and length `len`. | ||
/// | ||
/// Range should be within current start position and length. | ||
#[inline] | ||
pub fn set_range(&mut self, start: usize, len: usize) { | ||
assert!(self.start <= start && start + len <= self.start + self.len); | ||
self.start = start; | ||
self.len = len; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These weren't actually being used anywhere, and they have somewhat strange semantics as they don't take into account the pre-existing offset.
// ---------------------------------------------------------------------- | ||
// Immutable Buffer (BufferPtr) classes | ||
|
||
/// An representation of a slice on a reference-counting and read-only byte array. | ||
/// Sub-slices can be further created from this. The byte array will be released | ||
/// when all slices are dropped. | ||
/// | ||
/// TODO: Remove and replace with [`bytes::Bytes`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to do this as a follow up PR, but wanted to keep the changes small-ish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me from a code and API level. I like this change
I am a little concerned about the loss of MemoryTracker
API (as I have no idea who is using that API)
DataFusion does not seem to https://github.com/apache/arrow-datafusion/search?q=MemoryTracker
cc @nevi-me @sunchao @jhorstmann @bjchambers - it would also be great to ping anyone else who people may remember being interested in this area of parquet
It seems to have a non trivial number of dependents these days: https://crates.io/crates/parquet/reverse_dependencies
@@ -460,20 +460,20 @@ fn parse_v1_level( | |||
num_buffered_values: u32, | |||
encoding: Encoding, | |||
buf: ByteBufferPtr, | |||
) -> Result<ByteBufferPtr> { | |||
) -> Result<(usize, ByteBufferPtr)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend documenting in comments what this usize
is -- namely the number of bytes that was read?
}; | ||
|
||
// ---------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think highlighting the loss of MemTracker
in this change is probably important for anyone who is using it currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MemTracker
is a half-baked thing and I'm fine with removing it. Eventually, though, I think it'd still be nice to have a way to track memory usage in the parquet path. Perhaps we can have some memory manager that can be passed in from outside when reading or writing Parquet, which is responsible for allocating & recycling byte buffers, but I'm not sure how it can integrate nicely with the Bytes
changes.
My reasoning for not worrying about removing Memtracker is I made the API experimental a while back, and there haven't been any complaints. I can keep it if people feel strongly, I'm just trying to reduce the amount of custom code we have 😅 |
I am all for deleting it, especially with some evidence that it isn't used 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I think we can make more changes based on this in future if we need to add a memory manager.
bytes
in parquet rather than custom Buffer implementation (#1474)
Which issue does this PR close?
Part of #1474.
Rationale for this change
See ticket, in particular I want to use this as part of #1605
What changes are included in this PR?
This replaces
Buffer
withVec
, it also makeByteBufferPtr
useBytes
internally. A follow up PR will replaceByteBufferPtr
withBytes
but I wanted to avoid making this PR too largeAre there any user-facing changes?
Technically this only makes changes to experimental APIs, but it probably constitutes a breaking change