-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Documentation of Parquet ChunkReader #4118
Comments
This is a consequence of #2464 which causes The reason for the overlapping byte ranges, is that if the Taking a step back I wonder if you've considered using the async_reader. Not only does this provide a native async interface, but the AsyncFileReader interface naturally lends itself to IO pre-fetching for an entire row group at a time. There is also out of the box integration with object_store which may be of interest |
ParquetRecordBatchReader
reads overlapping byte ranges
Hi @tustvold, thanks for the explanation! Yes, we considered async reader and object_store, and now this is compelling enough reason to prioritize working on it : ) The API looks simple enough, should be easy to integrate. We'll have to extend PS: happy to see how fast this project is developing, awesome job, guys! |
Describe the bug
Not sure that it's a bug, but it seems that
arrow-rs
version37
performs more read operations from parquet files compared to version19
(which we have been using so far). Some of the byte ranges seem to be overlapping (see the output below). For the context we use a custom implementation ofChunkReader
withParquetRecordBatchReader
(and withSerializedFileReader
inv19
) to access S3 storage. Here's a reduced implementation:(I added
println!("S3Request::get_read(): {}, {}", start, end);
to track each read operations)In the output we get 8 read operations (
v37
):While with the same implementation we only get 4 read operations using
SerializedFileReader
andParquetFileArrowReader
(inv19
):Was that an intended change?
The text was updated successfully, but these errors were encountered: