Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Platform-dependent hashes of parquet files? #40202

Open
emmamendelsohn opened this issue Feb 22, 2024 · 23 comments
Open

[R] Platform-dependent hashes of parquet files? #40202

emmamendelsohn opened this issue Feb 22, 2024 · 23 comments

Comments

@emmamendelsohn
Copy link

emmamendelsohn commented Feb 22, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Moving this from ROpenSci slack. Our team has Mac, Linux, and Windows users, and we have found that we get three different hashes when saving parquet files.

arrow::write_parquet(mtcars, "mtcars.parquet")
digest::digest("mtcars.parquet", file = TRUE)

Mac "05be83226acb5d2a673d922ff9f69414"
Linux "8bddf47bdbede54d87ec3c4cbec280da"
Windows "bef251d299843f07348248416572edab"

When uncompressed, we get the same hashes for Linux and Windows, different for Mac.

arrow::write_parquet(mtcars, "mtcars.parquet", compression = "uncompressed" )
digest::digest("mtcars.parquet", file = TRUE)

Mac "58ec2e7a6d614db15fc2123455a83a7e"
Linux "4f3f049ffebdb395c489864e90d5e36b"
Windows "4f3f049ffebdb395c489864e90d5e36b"

arrow_info() for our three systems:

Mac
Arrow package version: 14.0.0.2

Capabilities:
               
acero      TRUE
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
                  
Allocator mimalloc
Current    0 bytes
Max       50.62 Kb

Runtime:
                        
SIMD Level          none
Detected SIMD Level none

Build:
                                                             
C++ Library Version                                    14.0.0
C++ Compiler                                       AppleClang
C++ Compiler Version                          15.0.0.15000040
Git ID               2dcee3f82c6cf54b53a64729fd81840efa583244
Linux
Arrow package version: 14.0.0.2

Capabilities:
               
acero      TRUE
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
                  
Allocator jemalloc
Current    0 bytes
Max        0 bytes

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                           
C++ Library Version  14.0.0
C++ Compiler            GNU
C++ Compiler Version 11.4.0
Windows
Arrow package version: 14.0.0.2

Capabilities:
               
acero      TRUE
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():
                       
arrow.use_threads FALSE

Memory:
                  
Allocator mimalloc
Current    0 bytes
Max        0 bytes

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                                             
C++ Library Version                                    14.0.0
C++ Compiler                                              GNU
C++ Compiler Version                                   10.3.0
Git ID               2dcee3f82c6cf54b53a64729fd81840efa583244

Component(s)

Parquet, R

@jonkeane
Copy link
Member

Could you try using parquet-tools or parquet cli to inspect the different files and see if there are any differences (if you can, posting the output here for each would be helpful)

I suspect there are differences due to compression or differences between default layouts that would cause different hashes to files like these.

@emmamendelsohn
Copy link
Author

Got identical results for the three, other than difference in space saved value.

Mac
############ file meta data ############
created_by: parquet-cpp-arrow version 14.0.0
num_columns: 11
num_rows: 32
num_row_groups: 1
format_version: 2.6
serialized_size: 2823


############ Columns ############
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb

############ Column(mpg) ############
name: mpg
path: mpg
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 22%)

############ Column(cyl) ############
name: cyl
path: cyl
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(disp) ############
name: disp
path: disp
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 20%)

############ Column(hp) ############
name: hp
path: hp
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 20%)

############ Column(drat) ############
name: drat
path: drat
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 9%)

############ Column(wt) ############
name: wt
path: wt
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 12%)

############ Column(qsec) ############
name: qsec
path: qsec
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 12%)

############ Column(vs) ############
name: vs
path: vs
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(am) ############
name: am
path: am
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(gear) ############
name: gear
path: gear
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(carb) ############
name: carb
path: carb
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 8%
Linux
############ file meta data ############
created_by: parquet-cpp-arrow version 14.0.0
num_columns: 11
num_rows: 32
num_row_groups: 1
format_version: 2.6
serialized_size: 2823


############ Columns ############
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb

############ Column(mpg) ############
name: mpg
path: mpg
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 22%)

############ Column(cyl) ############
name: cyl
path: cyl
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(disp) ############
name: disp
path: disp
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 20%)

############ Column(hp) ############
name: hp
path: hp
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 20%)

############ Column(drat) ############
name: drat
path: drat
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 9%)

############ Column(wt) ############
name: wt
path: wt
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 12%)

############ Column(qsec) ############
name: qsec
path: qsec
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 13%)

############ Column(vs) ############
name: vs
path: vs
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(am) ############
name: am
path: am
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(gear) ############
name: gear
path: gear
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(carb) ############
name: carb
path: carb
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 8%)
Windows
############ file meta data ############
created_by: parquet-cpp-arrow version 14.0.0
num_columns: 11
num_rows: 32
num_row_groups: 1
format_version: 2.6
serialized_size: 2823


############ Columns ############
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb

############ Column(mpg) ############
name: mpg
path: mpg
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 22%)

############ Column(cyl) ############
name: cyl
path: cyl
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(disp) ############
name: disp
path: disp
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 20%)

############ Column(hp) ############
name: hp
path: hp
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 20%)

############ Column(drat) ############
name: drat
path: drat
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 9%)

############ Column(wt) ############
name: wt
path: wt
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 12%)

############ Column(qsec) ############
name: qsec
path: qsec
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 13%)

############ Column(vs) ############
name: vs
path: vs
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(am) ############
name: am
path: am
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(gear) ############
name: gear
path: gear
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(carb) ############
name: carb
path: carb
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 8%)

@amoeba
Copy link
Member

amoeba commented Feb 22, 2024

Thanks for the help here, @emmamendelsohn. Could you zip up all three Parquet files and attach them here?

@kou kou changed the title Platform-dependent hashes of parquet files? [R] Platform-dependent hashes of parquet files? Feb 22, 2024
@amoeba
Copy link
Member

amoeba commented Feb 22, 2024

I managed to reproduce getting different checksums for files written using macOS and Linux and am attaching them here in case anyone wants to take a look: mtcars-parquet.zip. Both were written with arrow::write_parquet(mtcars, "mtcars.parquet", compression = "uncompressed") using arrow R 14.0.0.2.

When I run parquet-tools inspect on each file with --detail, I get two differences in output. The first is some unlabeled number that's either 262658 or 262914 (diff of 256 which is a bit conspicuous) depending on the file and the second difference is in the KeyValue metadata for the ARROW:schema key. I wonder if the two differences are related.

@emmamendelsohn
Copy link
Author

emmamendelsohn commented Feb 23, 2024

Here are the three files for the compressed example (arrow::write_parquet(mtcars, "mtcars.parquet")). With --detail I see there are differences in file and page offsets.

snappy-mtcars-parquet.zip

@pitrou
Copy link
Member

pitrou commented Mar 4, 2024

I am not surprised by difference in compression depending on the exact version of the compression library (Snappy), which also depends on the platform and the Arrow version numbers.

@pitrou
Copy link
Member

pitrou commented Mar 4, 2024

Ok, the uncompressed difference is in the R-specific metadata that's stored with Arrow tables. Either @nealrichardson @jonkeane or @paleolimbot would probably be able to explain what it's about, and why it may vary from platform to platform.

@pitrou
Copy link
Member

pitrou commented Mar 4, 2024

And, yeah, the format of the "r" metadata is very similar to the example showed in http://richfitz.github.io/redux/reference/object_to_string.html

Under PyArrow:

>>> a = pq.read_table("/home/antoine/arrow/data/mtcars-linux-uncompressed.parquet")
>>> b = pq.read_table("/home/antoine/arrow/data/mtcars-macos-uncompressed.parquet")
>>> a.schema.metadata
{b'r': b'A\n3\n262658\n197888\n5\nUTF-8\n531\n1\n531\n11\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n1026\n1\n262153\n5\nnames\n16\n11\n262153\n3\nmpg\n262153\n3\ncyl\n262153\n4\ndisp\n262153\n2\nhp\n262153\n4\ndrat\n262153\n2\nwt\n262153\n4\nqsec\n262153\n2\nvs\n262153\n2\nam\n262153\n4\ngear\n262153\n4\ncarb\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n'}
>>> b.schema.metadata
{b'r': b'A\n3\n262914\n197888\n5\nUTF-8\n531\n1\n531\n11\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n1026\n1\n262153\n5\nnames\n16\n11\n262153\n3\nmpg\n262153\n3\ncyl\n262153\n4\ndisp\n262153\n2\nhp\n262153\n4\ndrat\n262153\n2\nwt\n262153\n4\nqsec\n262153\n2\nvs\n262153\n2\nam\n262153\n4\ngear\n262153\n4\ncarb\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n'}
>>> a.schema.metadata == b.schema.metadata
False

@pitrou
Copy link
Member

pitrou commented Mar 4, 2024

By the way, 262658 is 0x40202 while 262914 is 0x40302, so this might very well be dependent on the R version you generated those files with (4.2.2 vs. 4.3.2?). Probably easy to verify.

@emmamendelsohn
Copy link
Author

All files from my example with R 4.3.2.

@pitrou
Copy link
Member

pitrou commented Mar 4, 2024

@emmamendelsohn Ah, I was talking about the uncompressed example from @amoeba . As I said above, differences in compressed files should not be a surprise. Do you still see differences if you generate uncompressed files?

@emmamendelsohn
Copy link
Author

I see. Yes for uncompressed we found Linux and Windows had the same hash, while macOS was different, all on 4.3.2. Let me know if you'd like me to share those files.

@pitrou
Copy link
Member

pitrou commented Mar 4, 2024

Thank you! Yes, you can share the Linux and macOS files for example.

(I suspect the final reason will be similar: slightly different R metadata serialized, for which I'll let R-Arrow experts answer :-))

@amoeba
Copy link
Member

amoeba commented Mar 4, 2024

Thanks for looking at this @pitrou, the R version and metadata causing the issue makes sense. I'll look into what we're doing in that regard next.

@amoeba amoeba self-assigned this Mar 4, 2024
@emmamendelsohn
Copy link
Author

Actually, I was mistaken, all three systems have different hashes when uncompressed. This matches @amoeba's example above.
uncompressed-mtcars-parquet.zip

@pitrou
Copy link
Member

pitrou commented Mar 5, 2024

Thanks @emmamendelsohn . After taking a quick look:

  1. all three files differ only in the Parquet metadata, not the actual data
  2. once deserialized, the Arrow schema is the same, except for R metadata (depending on R version perhaps: it might have been 4.3.3 on Linux vs. 4.3.2 on Windows and Mac?)
  3. hence, most of the difference seems to be in the way the Arrow schema is serialized by flatbuffers. This is certainly harmless as long as the data is the same once deserialized.

Is there a particular reason you were wondering about these files being different?

@pitrou
Copy link
Member

pitrou commented Mar 5, 2024

This is an interesting flatbuffers commit message as we do have a similar piece of code. And binary inspection of the serialized Flatbuffers metadata seems to match this interpretation.

@emmamendelsohn
Copy link
Author

@pitrou the different hashes became an issue for our team using a collaborative R targets workflow. In short, we use a shared S3 bucket for object storage so that each user can easily access the same versioned objects. This is especially useful for things like model objects that take a long time to produce. However, for large raw data files, we've found that the cost of transferring to/from AWS is too high, so each user saves the files locally as parquets. The targets version tracking system needs to register that these local files have the expected hash to be able to run downstream endpoints. When the file hashes differ across systems, targets detects a change and invalidates subsequent endpoints.

Anyway, we're rethinking some aspects of this approach, and so this may not be relevant in the future. Appreciate you looking into it nonetheless!

@pitrou
Copy link
Member

pitrou commented Mar 5, 2024

Yes, I think you should probably reconsider, because it is not realistic to expect a sophisticated compression-based format like Parquet to always generate the same bitwise data using slightly different producers.

@emmamendelsohn
Copy link
Author

Makes sense!

@noamross
Copy link
Contributor

noamross commented Mar 6, 2024

Would @nealrichardson @jonkeane or @paleolimbot be able to explain the R-specific metadata that generated maybe point to the code in the package where this occurs? From a quick inspection it looks a summary of the data frame schema in R's ASCII serialization format.

@amoeba
Copy link
Member

amoeba commented Mar 6, 2024

@noamross it looks like we do that here

arrow/r/R/metadata.R

Lines 19 to 33 in 9ca7d78

.serialize_arrow_r_metadata <- function(x) {
assert_is(x, "list")
# drop problems attributes (most likely from readr)
x[["attributes"]][["problems"]] <- NULL
# remove the class if it's just data.frame
if (identical(x$attributes$class, "data.frame")) {
x$attributes <- x$attributes[names(x$attributes) != "class"]
if (is_empty(x$attributes)) {
x <- x[names(x) != "attributes"]
}
}
out <- serialize(x, NULL, ascii = TRUE)

(calling into serialize as you guessed)

@nealrichardson
Copy link
Member

@noamross IIRC the purpose of this is so that object attributes, including R class names, is preserved so that you can round-trip the data to parquet or arrow files and get the same R types back. If you had a bare data.frame and only vanilla R vector types, I would expect the metadata to be empty.

amoeba added a commit that referenced this issue May 16, 2024
…0392)

### Rationale for this change

This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration.  This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.

### What changes are included in this PR?

So far I have only made a partial change to part of the metadata serialization.  I need to look at whether other calls to flatbuffers require similar treatment.

### Are these changes tested?

Not yet, this is a draft PR

### Are there any user-facing changes?

No 

* GitHub Issue: #40361

Lead-authored-by: Noam Ross <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…ic (apache#40392)

### Rationale for this change

This is the start of a PR to address apache#40361, and in turn apache#40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration.  This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.

### What changes are included in this PR?

So far I have only made a partial change to part of the metadata serialization.  I need to look at whether other calls to flatbuffers require similar treatment.

### Are these changes tested?

Not yet, this is a draft PR

### Are there any user-facing changes?

No 

* GitHub Issue: apache#40361

Lead-authored-by: Noam Ross <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants