Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bytes in parquet rather than custom Buffer implementation (#1474) #1683

Merged
merged 1 commit into from
May 11, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 9, 2022

Which issue does this PR close?

Part of #1474.

Rationale for this change

See ticket, in particular I want to use this as part of #1605

What changes are included in this PR?

This replaces Buffer with Vec, it also make ByteBufferPtr use Bytes internally. A follow up PR will replace ByteBufferPtr with Bytes but I wanted to avoid making this PR too large

Are there any user-facing changes?

Technically this only makes changes to experimental APIs, but it probably constitutes a breaking change

@tustvold tustvold added the api-change Changes to the arrow API label May 9, 2022
@github-actions github-actions bot added the parquet Changes to the parquet crate label May 9, 2022
@@ -207,7 +207,7 @@ impl LevelDecoder {
let num_bytes =
ceil((num_buffered_values * bit_width as usize) as i64, 8);
let data_size = cmp::min(num_bytes as usize, data.len());
decoder.reset(data.range(data.start(), data_size));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a bug as data.range already takes into account the start offset

Comment on lines -304 to -317
pub fn with_range(mut self, start: usize, len: usize) -> Self {
self.set_range(start, len);
self
}

/// Updates this buffer with new `start` position and length `len`.
///
/// Range should be within current start position and length.
#[inline]
pub fn set_range(&mut self, start: usize, len: usize) {
assert!(self.start <= start && start + len <= self.start + self.len);
self.start = start;
self.len = len;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These weren't actually being used anywhere, and they have somewhat strange semantics as they don't take into account the pre-existing offset.

// ----------------------------------------------------------------------
// Immutable Buffer (BufferPtr) classes

/// An representation of a slice on a reference-counting and read-only byte array.
/// Sub-slices can be further created from this. The byte array will be released
/// when all slices are dropped.
///
/// TODO: Remove and replace with [`bytes::Bytes`]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to do this as a follow up PR, but wanted to keep the changes small-ish

@alamb alamb changed the title Use bytes in parquet (#1474) Use bytes in parquet rather than custom Buffer implementation (#1474) May 10, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me from a code and API level. I like this change

I am a little concerned about the loss of MemoryTracker API (as I have no idea who is using that API)

DataFusion does not seem to https://github.com/apache/arrow-datafusion/search?q=MemoryTracker

cc @nevi-me @sunchao @jhorstmann @bjchambers - it would also be great to ping anyone else who people may remember being interested in this area of parquet

It seems to have a non trivial number of dependents these days: https://crates.io/crates/parquet/reverse_dependencies

@@ -460,20 +460,20 @@ fn parse_v1_level(
num_buffered_values: u32,
encoding: Encoding,
buf: ByteBufferPtr,
) -> Result<ByteBufferPtr> {
) -> Result<(usize, ByteBufferPtr)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend documenting in comments what this usize is -- namely the number of bytes that was read?

};

// ----------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think highlighting the loss of MemTracker in this change is probably important for anyone who is using it currently

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MemTracker is a half-baked thing and I'm fine with removing it. Eventually, though, I think it'd still be nice to have a way to track memory usage in the parquet path. Perhaps we can have some memory manager that can be passed in from outside when reading or writing Parquet, which is responsible for allocating & recycling byte buffers, but I'm not sure how it can integrate nicely with the Bytes changes.

@tustvold
Copy link
Contributor Author

My reasoning for not worrying about removing Memtracker is I made the API experimental a while back, and there haven't been any complaints.

I can keep it if people feel strongly, I'm just trying to reduce the amount of custom code we have 😅

@alamb
Copy link
Contributor

alamb commented May 11, 2022

My reasoning for not worrying about removing Memtracker is I made the API experimental a while back, and there haven't been any complaints.

I am all for deleting it, especially with some evidence that it isn't used 👍

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I think we can make more changes based on this in future if we need to add a memory manager.

@tustvold tustvold merged commit b9a41f3 into apache:master May 11, 2022
@alamb alamb changed the title Use bytes in parquet rather than custom Buffer implementation (#1474) Use bytes in parquet rather than custom Buffer implementation (#1474) May 12, 2022
@tustvold tustvold mentioned this pull request May 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants