Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Make Flatbuffers serialization more deterministic #40361

Closed
pitrou opened this issue Mar 5, 2024 · 9 comments
Closed

[C++] Make Flatbuffers serialization more deterministic #40361

pitrou opened this issue Mar 5, 2024 · 9 comments

Comments

@pitrou
Copy link
Member

pitrou commented Mar 5, 2024

Describe the enhancement requested

In #40202 (comment) it was determined that Flatbuffers serialization of a Arrow schema did not always result in the same binary encoding.

A changeset in Flatbuffers led us to a likely explanation, as there's a place in our code where serialization of strings depends on argument evaluation order:

static KeyValueOffset AppendKeyValue(FBB& fbb, const std::string& key,
const std::string& value) {
return flatbuf::CreateKeyValue(fbb, fbb.CreateString(key), fbb.CreateString(value));
}

Binary inspection of the data files provided in that issue seems to confirm that hypothesis.

This is obviously a very minor issue, but should also be easy to fix.

Component(s)

C++

@pitrou
Copy link
Member Author

pitrou commented Mar 5, 2024

cc @felipecrv

@noamross
Copy link
Contributor

noamross commented Mar 6, 2024

Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change.

@amoeba
Copy link
Member

amoeba commented Mar 6, 2024

I think a regression test would be enough, so long as you can reproduce the non-determinism in that test before making the code change.

@noamross
Copy link
Contributor

noamross commented Mar 6, 2024

What I mean is that I'm not sure how to reproduce the non-determinism in the original. In my understanding, the order is actually determined at compile-time and could differ across compilers or platforms.

@amoeba
Copy link
Member

amoeba commented Mar 6, 2024

Right. There might be a better way than this but adding a test that exercises SchemaToFlatbuffer and asserts the resulting Flatbuffer schema is byte-identical to some value you get, that test should fail on CI one one or more platforms. You could put up a draft PR and let CI exercise it to save you work. @felipecrv will probably have a better idea though.

@felipecrv
Copy link
Contributor

Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change.

I wouldn't worry to much about adding a test. The non-determinism here doesn't even lead to bugs. But it creates non-determinism in output that can make testing and file comparisons harder. In a way, by fixing this you are already contributing to testability itself.

@pitrou
Copy link
Member Author

pitrou commented Mar 7, 2024

I disagree and think we should try to add a test for this, if only to validate that we are actually fixing something.

amoeba added a commit that referenced this issue May 16, 2024
…0392)

### Rationale for this change

This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration.  This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.

### What changes are included in this PR?

So far I have only made a partial change to part of the metadata serialization.  I need to look at whether other calls to flatbuffers require similar treatment.

### Are these changes tested?

Not yet, this is a draft PR

### Are there any user-facing changes?

No 

* GitHub Issue: #40361

Lead-authored-by: Noam Ross <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
@amoeba amoeba added this to the 17.0.0 milestone May 16, 2024
@amoeba
Copy link
Member

amoeba commented May 16, 2024

Issue resolved by pull request 40392
#40392

@amoeba amoeba closed this as completed May 16, 2024
@amoeba
Copy link
Member

amoeba commented May 16, 2024

This has been merged. Thanks to those who reviewed and extra thanks to @noamross for contributing the initial PR.

vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…ic (apache#40392)

### Rationale for this change

This is the start of a PR to address apache#40361, and in turn apache#40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration.  This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.

### What changes are included in this PR?

So far I have only made a partial change to part of the metadata serialization.  I need to look at whether other calls to flatbuffers require similar treatment.

### Are these changes tested?

Not yet, this is a draft PR

### Are there any user-facing changes?

No 

* GitHub Issue: apache#40361

Lead-authored-by: Noam Ross <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants