-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Make Flatbuffers serialization more deterministic #40361
Comments
cc @felipecrv |
Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change. |
I think a regression test would be enough, so long as you can reproduce the non-determinism in that test before making the code change. |
What I mean is that I'm not sure how to reproduce the non-determinism in the original. In my understanding, the order is actually determined at compile-time and could differ across compilers or platforms. |
Right. There might be a better way than this but adding a test that exercises |
I wouldn't worry to much about adding a test. The non-determinism here doesn't even lead to bugs. But it creates non-determinism in output that can make testing and file comparisons harder. In a way, by fixing this you are already contributing to testability itself. |
I disagree and think we should try to add a test for this, if only to validate that we are actually fixing something. |
…0392) ### Rationale for this change This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences. ### What changes are included in this PR? So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment. ### Are these changes tested? Not yet, this is a draft PR ### Are there any user-facing changes? No * GitHub Issue: #40361 Lead-authored-by: Noam Ross <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Bryce Mecum <[email protected]>
Issue resolved by pull request 40392 |
This has been merged. Thanks to those who reviewed and extra thanks to @noamross for contributing the initial PR. |
…ic (apache#40392) ### Rationale for this change This is the start of a PR to address apache#40361, and in turn apache#40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences. ### What changes are included in this PR? So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment. ### Are these changes tested? Not yet, this is a draft PR ### Are there any user-facing changes? No * GitHub Issue: apache#40361 Lead-authored-by: Noam Ross <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Bryce Mecum <[email protected]>
Describe the enhancement requested
In #40202 (comment) it was determined that Flatbuffers serialization of a Arrow schema did not always result in the same binary encoding.
A changeset in Flatbuffers led us to a likely explanation, as there's a place in our code where serialization of strings depends on argument evaluation order:
arrow/cpp/src/arrow/ipc/metadata_internal.cc
Lines 478 to 481 in 3ba6d28
Binary inspection of the data files provided in that issue seems to confirm that hypothesis.
This is obviously a very minor issue, but should also be easy to fix.
Component(s)
C++
The text was updated successfully, but these errors were encountered: