-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Platform-dependent hashes of parquet files? #40202
Comments
Could you try using parquet-tools or parquet cli to inspect the different files and see if there are any differences (if you can, posting the output here for each would be helpful) I suspect there are differences due to compression or differences between default layouts that would cause different hashes to files like these. |
Got identical results for the three, other than difference in space saved value. Mac
Linux
Windows
|
Thanks for the help here, @emmamendelsohn. Could you zip up all three Parquet files and attach them here? |
I managed to reproduce getting different checksums for files written using macOS and Linux and am attaching them here in case anyone wants to take a look: mtcars-parquet.zip. Both were written with When I run parquet-tools inspect on each file with --detail, I get two differences in output. The first is some unlabeled number that's either 262658 or 262914 (diff of 256 which is a bit conspicuous) depending on the file and the second difference is in the KeyValue metadata for the |
Here are the three files for the compressed example ( |
I am not surprised by difference in compression depending on the exact version of the compression library (Snappy), which also depends on the platform and the Arrow version numbers. |
Ok, the uncompressed difference is in the R-specific metadata that's stored with Arrow tables. Either @nealrichardson @jonkeane or @paleolimbot would probably be able to explain what it's about, and why it may vary from platform to platform. |
And, yeah, the format of the "r" metadata is very similar to the example showed in http://richfitz.github.io/redux/reference/object_to_string.html Under PyArrow: >>> a = pq.read_table("/home/antoine/arrow/data/mtcars-linux-uncompressed.parquet")
>>> b = pq.read_table("/home/antoine/arrow/data/mtcars-macos-uncompressed.parquet")
>>> a.schema.metadata
{b'r': b'A\n3\n262658\n197888\n5\nUTF-8\n531\n1\n531\n11\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n1026\n1\n262153\n5\nnames\n16\n11\n262153\n3\nmpg\n262153\n3\ncyl\n262153\n4\ndisp\n262153\n2\nhp\n262153\n4\ndrat\n262153\n2\nwt\n262153\n4\nqsec\n262153\n2\nvs\n262153\n2\nam\n262153\n4\ngear\n262153\n4\ncarb\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n'}
>>> b.schema.metadata
{b'r': b'A\n3\n262914\n197888\n5\nUTF-8\n531\n1\n531\n11\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n1026\n1\n262153\n5\nnames\n16\n11\n262153\n3\nmpg\n262153\n3\ncyl\n262153\n4\ndisp\n262153\n2\nhp\n262153\n4\ndrat\n262153\n2\nwt\n262153\n4\nqsec\n262153\n2\nvs\n262153\n2\nam\n262153\n4\ngear\n262153\n4\ncarb\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n'}
>>> a.schema.metadata == b.schema.metadata
False |
By the way, |
All files from my example with R 4.3.2. |
@emmamendelsohn Ah, I was talking about the uncompressed example from @amoeba . As I said above, differences in compressed files should not be a surprise. Do you still see differences if you generate uncompressed files? |
I see. Yes for uncompressed we found Linux and Windows had the same hash, while macOS was different, all on 4.3.2. Let me know if you'd like me to share those files. |
Thank you! Yes, you can share the Linux and macOS files for example. (I suspect the final reason will be similar: slightly different R metadata serialized, for which I'll let R-Arrow experts answer :-)) |
Thanks for looking at this @pitrou, the R version and metadata causing the issue makes sense. I'll look into what we're doing in that regard next. |
Actually, I was mistaken, all three systems have different hashes when uncompressed. This matches @amoeba's example above. |
Thanks @emmamendelsohn . After taking a quick look:
Is there a particular reason you were wondering about these files being different? |
This is an interesting flatbuffers commit message as we do have a similar piece of code. And binary inspection of the serialized Flatbuffers metadata seems to match this interpretation. |
@pitrou the different hashes became an issue for our team using a collaborative R Anyway, we're rethinking some aspects of this approach, and so this may not be relevant in the future. Appreciate you looking into it nonetheless! |
Yes, I think you should probably reconsider, because it is not realistic to expect a sophisticated compression-based format like Parquet to always generate the same bitwise data using slightly different producers. |
Makes sense! |
Would @nealrichardson @jonkeane or @paleolimbot be able to explain the R-specific metadata that generated maybe point to the code in the package where this occurs? From a quick inspection it looks a summary of the data frame schema in R's ASCII serialization format. |
@noamross it looks like we do that here Lines 19 to 33 in 9ca7d78
(calling into |
@noamross IIRC the purpose of this is so that object attributes, including R class names, is preserved so that you can round-trip the data to parquet or arrow files and get the same R types back. If you had a bare data.frame and only vanilla R vector types, I would expect the metadata to be empty. |
…0392) ### Rationale for this change This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences. ### What changes are included in this PR? So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment. ### Are these changes tested? Not yet, this is a draft PR ### Are there any user-facing changes? No * GitHub Issue: #40361 Lead-authored-by: Noam Ross <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Bryce Mecum <[email protected]>
…ic (apache#40392) ### Rationale for this change This is the start of a PR to address apache#40361, and in turn apache#40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences. ### What changes are included in this PR? So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment. ### Are these changes tested? Not yet, this is a draft PR ### Are there any user-facing changes? No * GitHub Issue: apache#40361 Lead-authored-by: Noam Ross <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Bryce Mecum <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
Moving this from ROpenSci slack. Our team has Mac, Linux, and Windows users, and we have found that we get three different hashes when saving parquet files.
Mac "05be83226acb5d2a673d922ff9f69414"
Linux "8bddf47bdbede54d87ec3c4cbec280da"
Windows "bef251d299843f07348248416572edab"
When uncompressed, we get the same hashes for Linux and Windows, different for Mac.
Mac "58ec2e7a6d614db15fc2123455a83a7e"
Linux "4f3f049ffebdb395c489864e90d5e36b"
Windows "4f3f049ffebdb395c489864e90d5e36b"
arrow_info()
for our three systems:Mac
Linux
Windows
Component(s)
Parquet, R
The text was updated successfully, but these errors were encountered: