[C++] Make Flatbuffers serialization more deterministic #40361

pitrou · 2024-03-05T12:07:43Z

Describe the enhancement requested

In #40202 (comment) it was determined that Flatbuffers serialization of a Arrow schema did not always result in the same binary encoding.

A changeset in Flatbuffers led us to a likely explanation, as there's a place in our code where serialization of strings depends on argument evaluation order:

arrow/cpp/src/arrow/ipc/metadata_internal.cc

Lines 478 to 481 in 3ba6d28

    
           static KeyValueOffset AppendKeyValue(FBB& fbb, const std::string& key, 
        
                                                const std::string& value) { 
        
             return flatbuf::CreateKeyValue(fbb, fbb.CreateString(key), fbb.CreateString(value)); 
        
           }

Binary inspection of the data files provided in that issue seems to confirm that hypothesis.

This is obviously a very minor issue, but should also be easy to fix.

Component(s)

C++

pitrou · 2024-03-05T12:33:21Z

cc @felipecrv

noamross · 2024-03-06T22:36:49Z

Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change.

amoeba · 2024-03-06T23:13:47Z

I think a regression test would be enough, so long as you can reproduce the non-determinism in that test before making the code change.

noamross · 2024-03-06T23:18:45Z

What I mean is that I'm not sure how to reproduce the non-determinism in the original. In my understanding, the order is actually determined at compile-time and could differ across compilers or platforms.

amoeba · 2024-03-06T23:51:42Z

Right. There might be a better way than this but adding a test that exercises SchemaToFlatbuffer and asserts the resulting Flatbuffer schema is byte-identical to some value you get, that test should fail on CI one one or more platforms. You could put up a draft PR and let CI exercise it to save you work. @felipecrv will probably have a better idea though.

felipecrv · 2024-03-07T23:10:17Z

Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change.

I wouldn't worry to much about adding a test. The non-determinism here doesn't even lead to bugs. But it creates non-determinism in output that can make testing and file comparisons harder. In a way, by fixing this you are already contributing to testability itself.

pitrou · 2024-03-07T23:35:44Z

I disagree and think we should try to add a test for this, if only to validate that we are actually fixing something.

…0392) ### Rationale for this change This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences. ### What changes are included in this PR? So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment. ### Are these changes tested? Not yet, this is a draft PR ### Are there any user-facing changes? No * GitHub Issue: #40361 Lead-authored-by: Noam Ross <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Bryce Mecum <[email protected]>

amoeba · 2024-05-16T23:18:18Z

Issue resolved by pull request 40392
#40392

amoeba · 2024-05-16T23:19:19Z

This has been merged. Thanks to those who reviewed and extra thanks to @noamross for contributing the initial PR.

…ic (apache#40392) ### Rationale for this change This is the start of a PR to address apache#40361, and in turn apache#40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences. ### What changes are included in this PR? So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment. ### Are these changes tested? Not yet, this is a draft PR ### Are there any user-facing changes? No * GitHub Issue: apache#40361 Lead-authored-by: Noam Ross <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Bryce Mecum <[email protected]>

pitrou added the Type: enhancement label Mar 5, 2024

github-actions bot added the Component: C++ label Mar 5, 2024

pitrou added good-second-issue and removed Component: C++ labels Mar 5, 2024

noamross mentioned this issue Mar 7, 2024

GH-40361: [C++] Make flatbuffers serialization more deterministic #40392

Merged

github-actions bot assigned noamross Mar 7, 2024

amoeba mentioned this issue Apr 4, 2024

[C++] Add test for flatbuffers serialization #41018

Closed

amoeba added the Component: C++ label May 16, 2024

amoeba added this to the 17.0.0 milestone May 16, 2024

amoeba closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Make Flatbuffers serialization more deterministic #40361

[C++] Make Flatbuffers serialization more deterministic #40361

pitrou commented Mar 5, 2024

pitrou commented Mar 5, 2024

noamross commented Mar 6, 2024

amoeba commented Mar 6, 2024

noamross commented Mar 6, 2024

amoeba commented Mar 6, 2024

felipecrv commented Mar 7, 2024

pitrou commented Mar 7, 2024

amoeba commented May 16, 2024

amoeba commented May 16, 2024

[C++] Make Flatbuffers serialization more deterministic #40361

[C++] Make Flatbuffers serialization more deterministic #40361

Comments

pitrou commented Mar 5, 2024

Describe the enhancement requested

Component(s)

pitrou commented Mar 5, 2024

noamross commented Mar 6, 2024

amoeba commented Mar 6, 2024

noamross commented Mar 6, 2024

amoeba commented Mar 6, 2024

felipecrv commented Mar 7, 2024

pitrou commented Mar 7, 2024

amoeba commented May 16, 2024

amoeba commented May 16, 2024