Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Documentation of Parquet ChunkReader #4118

Closed
zilder opened this issue Apr 24, 2023 · 2 comments · Fixed by #4156
Closed

Improve Documentation of Parquet ChunkReader #4118

zilder opened this issue Apr 24, 2023 · 2 comments · Fixed by #4156
Labels
documentation Improvements or additions to documentation parquet Changes to the parquet crate

Comments

@zilder
Copy link

zilder commented Apr 24, 2023

Describe the bug
Not sure that it's a bug, but it seems that arrow-rs version 37 performs more read operations from parquet files compared to version 19 (which we have been using so far). Some of the byte ranges seem to be overlapping (see the output below). For the context we use a custom implementation of ChunkReader with ParquetRecordBatchReader (and with SerializedFileReader in v19) to access S3 storage. Here's a reduced implementation:

pub struct S3Request {
    client: Client,
    bucket: String,
    key: String,
    len: u64,
    rt: Runtime,
}

impl ChunkReader for S3Request {
    type T = ByteBuf;

    fn get_read(
        &self,
        start: u64,
        length: usize,
    ) -> Result<Self::T, parquet::errors::ParquetError> {
        let end = start + length as u64 - 1;
        println!("S3Request::get_read(): {}, {}", start, end);

        let data = self
            .rt
            .block_on(async {
                let resp = match self
                    .client
                    .get_object()
                    .bucket(&self.bucket)
                    .key(&self.key)
                    .range(format!("bytes={}-{}", start, end))
                    .send()
                    .await
                {
                    Ok(r) => r,
                    Err(e) => {
                        panic!("{}", e);
                    },
                };

                resp.body.collect().await
            })
            .unwrap();

        Ok(ByteBuf(data))
    }
}

(I added println!("S3Request::get_read(): {}, {}", start, end); to track each read operations)

In the output we get 8 read operations (v37):

S3Request::get_read(): 2359, 2366
S3Request::get_read(): 435, 2358
S3Request::get_read(): 4, 121
S3Request::get_read(): 18, 121
S3Request::get_read(): 43, 121
S3Request::get_read(): 214, 331
S3Request::get_read(): 228, 331
S3Request::get_read(): 253, 331
+----+-------------+
| ts | temperature |
+----+-------------+
| 1  | 111         |
| 5  | 555         |
+----+-------------+

While with the same implementation we only get 4 read operations using SerializedFileReader and ParquetFileArrowReader (in v19):

S3Request::get_read(): 2359, 2366
S3Request::get_read(): 435, 2358
S3Request::get_read(): 4, 121
S3Request::get_read(): 214, 331
+----+-------------+
| ts | temperature |
+----+-------------+
| 1  | 111         |
| 5  | 555         |
+----+-------------+

Was that an intended change?

@zilder zilder added the bug label Apr 24, 2023
@tustvold
Copy link
Contributor

tustvold commented Apr 24, 2023

This is a consequence of #2464 which causes ChunkReader::get_read to be called per page, instead of per row group. This change was made to enable page-level predicate push down. We should definitely improve the documentation around ChunkReader, and its implicit assumptions regarding buffering at the application and/or OS level. I will add this to my list.

The reason for the overlapping byte ranges, is that if the OffsetIndex isn't read, the reader doesn't know where the pages are located or even how many there are, only the end position of the column chunk. It therefore has to assume a given page may run to the end of the range. If you enable reading the PageIndex it shouldn't perform overlapping reads (although it will now need to perform IO to read the page index).

Taking a step back I wonder if you've considered using the async_reader. Not only does this provide a native async interface, but the AsyncFileReader interface naturally lends itself to IO pre-fetching for an entire row group at a time.

There is also out of the box integration with object_store which may be of interest

@tustvold tustvold changed the title ParquetRecordBatchReader reads overlapping byte ranges Improve Documentation of Parquet ChunkReader Apr 24, 2023
@tustvold tustvold added documentation Improvements or additions to documentation and removed bug labels Apr 24, 2023
@zilder
Copy link
Author

zilder commented Apr 25, 2023

Hi @tustvold, thanks for the explanation! Yes, we considered async reader and object_store, and now this is compelling enough reason to prioritize working on it : ) The API looks simple enough, should be easy to integrate. We'll have to extend ParquetObjectReader a little to collect some statistics (# of requests, amount of data) for internal use, but otherwise looks good.

PS: happy to see how fast this project is developing, awesome job, guys!

tustvold added a commit to tustvold/arrow-rs that referenced this issue Apr 27, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Apr 28, 2023
tustvold added a commit that referenced this issue May 2, 2023
* Remove length from ChunkReader (#4118)

* Remove ChunkReader::T Send bound

* Remove FileSource

* Tweak docs
@tustvold tustvold added the parquet Changes to the parquet crate label May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants