-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort filenames when reading parquet to ensure consistent schema #6629
Conversation
a5e94a7
to
51ca29f
Compare
Thanks @thomas-k-cameron -- given my reading of #5779 it seems like perhaps a more direct solution might be to sort the list of files so they are always processed in the same order Perhaps somehow figure out how to sort the let schemas: Vec<_> = futures::stream::iter(objects)
.map(|object| fetch_schema(store.as_ref(), object, self.metadata_size_hint))
.boxed() // Workaround https://github.com/rust-lang/rust/issues/64552
.buffered(SCHEMA_INFERENCE_CONCURRENCY)
.try_collect()
.await?; |
Thank you for your advice. I couldn't figure out what I was getting it wrong. |
51ca29f
to
5e0f653
Compare
1cf246b
to
2d0f9bb
Compare
Let me give you an update. I tried to sort by location, but it hasn't been working. It seems like this is coming from how parquet files for tests are created with random names. I have few other ideas. I will see if it works out. |
938773b
to
ba6ab9e
Compare
da1c922
to
ac00cfa
Compare
cdba47b
to
03690ab
Compare
@alamb I updated the PR comment to reflect my changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @thomas-k-cameron -- I think this looks great. The only potential downside of this change I can think of is that this will change behavior for any object store that lists files sorted deterministicially in a different order than sort()
However, I think most object stores sort their locations lexographically so this PR is likely to simply make local filesystems consistent.
Thanks again
Thank you very much for your feedback and approval. Change that I just pushed simply applies a |
Thanks @thomas-k-cameron -- once the check pass I'll plan to merge this PR. |
🚀 |
…he#6629) * update * FIXed * add parquet * update * update * update * try2 * update * update * Add comments * cargo fmt --------- Co-authored-by: Andrew Lamb <[email protected]>
…he#6629) * update * FIXed * add parquet * update * update * update * try2 * update * update * Add comments * cargo fmt --------- Co-authored-by: Andrew Lamb <[email protected]>
…he#6629) * update * FIXed * add parquet * update * update * update * try2 * update * update * Add comments * cargo fmt --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #5779.
Rationale for this change
Sorting is passed to ensure that files returned is in the same order when the OS is different.
What changes are included in this PR?
This PR adds a lines of code to
FileFormat::infer_schema
toParquetFormat
; It sorts parquet file by it's location.On
test_util::store_parquet
is changed to sort temp file by it's path and and write batches in the sorted order.This change was introduced as it made the some tests, such as
datasource::file_format::parquet::tests::read_merged_batches
to fail.It appears that these tests assumed that
infer_schema
processed the files in the same order thatstore_parquet
's returning value files.Are these changes tested?
Yes. This PR introduces new test case
async fn issue_5779()
.Are there any user-facing changes?
The order of
Fields
may change.