Use `bytes` in parquet rather than custom Buffer implementation (#1474) #1683

tustvold · 2022-05-09T17:25:46Z

Which issue does this PR close?

Part of #1474.

Rationale for this change

See ticket, in particular I want to use this as part of #1605

What changes are included in this PR?

This replaces Buffer with Vec, it also make ByteBufferPtr use Bytes internally. A follow up PR will replace ByteBufferPtr with Bytes but I wanted to avoid making this PR too large

Are there any user-facing changes?

Technically this only makes changes to experimental APIs, but it probably constitutes a breaking change

tustvold · 2022-05-09T17:26:17Z

parquet/src/encodings/levels.rs

@@ -207,7 +207,7 @@ impl LevelDecoder {
                let num_bytes =
                    ceil((num_buffered_values * bit_width as usize) as i64, 8);
                let data_size = cmp::min(num_bytes as usize, data.len());
-                decoder.reset(data.range(data.start(), data_size));


This was actually a bug as data.range already takes into account the start offset

tustvold · 2022-05-09T17:26:51Z

parquet/src/util/memory.rs

-    pub fn with_range(mut self, start: usize, len: usize) -> Self {
-        self.set_range(start, len);
-        self
-    }
-
-    /// Updates this buffer with new `start` position and length `len`.
-    ///
-    /// Range should be within current start position and length.
-    #[inline]
-    pub fn set_range(&mut self, start: usize, len: usize) {
-        assert!(self.start <= start && start + len <= self.start + self.len);
-        self.start = start;
-        self.len = len;
-    }


These weren't actually being used anywhere, and they have somewhat strange semantics as they don't take into account the pre-existing offset.

tustvold · 2022-05-09T17:27:36Z

parquet/src/util/memory.rs

 // ----------------------------------------------------------------------
 // Immutable Buffer (BufferPtr) classes

 /// An representation of a slice on a reference-counting and read-only byte array.
 /// Sub-slices can be further created from this. The byte array will be released
 /// when all slices are dropped.
+///
+/// TODO: Remove and replace with [`bytes::Bytes`]


I intend to do this as a follow up PR, but wanted to keep the changes small-ish

alamb

Looks good to me from a code and API level. I like this change

I am a little concerned about the loss of MemoryTracker API (as I have no idea who is using that API)

DataFusion does not seem to https://github.com/apache/arrow-datafusion/search?q=MemoryTracker

cc @nevi-me @sunchao @jhorstmann @bjchambers - it would also be great to ping anyone else who people may remember being interested in this area of parquet

It seems to have a non trivial number of dependents these days: https://crates.io/crates/parquet/reverse_dependencies

alamb · 2022-05-10T20:04:03Z

parquet/src/column/reader.rs

@@ -460,20 +460,20 @@ fn parse_v1_level(
    num_buffered_values: u32,
    encoding: Encoding,
    buf: ByteBufferPtr,
-) -> Result<ByteBufferPtr> {
+) -> Result<(usize, ByteBufferPtr)> {


I recommend documenting in comments what this usize is -- namely the number of bytes that was read?

alamb · 2022-05-10T20:05:46Z

parquet/src/util/memory.rs

 };

-// ----------------------------------------------------------------------


I think highlighting the loss of MemTracker in this change is probably important for anyone who is using it currently

sunchao

The MemTracker is a half-baked thing and I'm fine with removing it. Eventually, though, I think it'd still be nice to have a way to track memory usage in the parquet path. Perhaps we can have some memory manager that can be passed in from outside when reading or writing Parquet, which is responsible for allocating & recycling byte buffers, but I'm not sure how it can integrate nicely with the Bytes changes.

tustvold · 2022-05-10T21:01:37Z

My reasoning for not worrying about removing Memtracker is I made the API experimental a while back, and there haven't been any complaints.

I can keep it if people feel strongly, I'm just trying to reduce the amount of custom code we have 😅

alamb · 2022-05-11T10:08:11Z

My reasoning for not worrying about removing Memtracker is I made the API experimental a while back, and there haven't been any complaints.

I am all for deleting it, especially with some evidence that it isn't used 👍

sunchao

+1. I think we can make more changes based on this in future if we need to add a memory manager.

Use bytes in parquet (apache#1474)

a7cf5a0

tustvold added the api-change Changes to the arrow API label May 9, 2022

github-actions bot added the parquet Changes to the parquet crate label May 9, 2022

tustvold commented May 9, 2022

View reviewed changes

alamb changed the title ~~Use bytes in parquet (#1474)~~ Use bytes in parquet rather than custom Buffer implementation (#1474) May 10, 2022

alamb approved these changes May 10, 2022

View reviewed changes

sunchao reviewed May 10, 2022

View reviewed changes

sunchao approved these changes May 11, 2022

View reviewed changes

tustvold merged commit b9a41f3 into apache:master May 11, 2022

alamb changed the title ~~Use bytes in parquet rather than custom Buffer implementation (#1474)~~ Use bytes in parquet rather than custom Buffer implementation (#1474) May 12, 2022

tustvold mentioned this pull request May 22, 2022

Fix parquet benchmarks #1723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `bytes` in parquet rather than custom Buffer implementation (#1474) #1683

Use `bytes` in parquet rather than custom Buffer implementation (#1474) #1683

tustvold commented May 9, 2022

tustvold May 9, 2022

tustvold May 9, 2022

tustvold May 9, 2022

alamb left a comment

alamb May 10, 2022

alamb May 10, 2022

sunchao left a comment

tustvold commented May 10, 2022

alamb commented May 11, 2022

sunchao left a comment

		};

		// ----------------------------------------------------------------------

Use bytes in parquet rather than custom Buffer implementation (#1474) #1683

Use bytes in parquet rather than custom Buffer implementation (#1474) #1683

Conversation

tustvold commented May 9, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold May 9, 2022

Choose a reason for hiding this comment

tustvold May 9, 2022

Choose a reason for hiding this comment

tustvold May 9, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb May 10, 2022

Choose a reason for hiding this comment

alamb May 10, 2022

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

tustvold commented May 10, 2022

alamb commented May 11, 2022

sunchao left a comment

Choose a reason for hiding this comment

Use `bytes` in parquet rather than custom Buffer implementation (#1474) #1683

Use `bytes` in parquet rather than custom Buffer implementation (#1474) #1683