Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40361: [C++] Make flatbuffers serialization more deterministic #40392

Merged
merged 18 commits into from
May 16, 2024

Conversation

noamross
Copy link
Contributor

@noamross noamross commented Mar 7, 2024

Rationale for this change

This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.

What changes are included in this PR?

So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment.

Are these changes tested?

Not yet, this is a draft PR

Are there any user-facing changes?

No

Copy link

github-actions bot commented Mar 7, 2024

⚠️ GitHub issue #40361 has been automatically assigned in GitHub to PR creator.

Copy link

github-actions bot commented Mar 7, 2024

⚠️ GitHub issue #40361 has no components, please add labels for components.

Copy link

github-actions bot commented Mar 7, 2024

⚠️ GitHub issue #40361 has no components, please add labels for components.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 7, 2024
@amoeba
Copy link
Member

amoeba commented Mar 7, 2024

Thanks @noamross!

@kou can you please approve all workflows to run?

@kou kou changed the title GH-40361 [C++] Make flatbuffers serialization more deterministic GH-40361: [C++] Make flatbuffers serialization more deterministic Mar 7, 2024
@kou
Copy link
Member

kou commented Mar 7, 2024

Approved!

Copy link

github-actions bot commented Mar 7, 2024

⚠️ GitHub issue #40361 has no components, please add labels for components.

@pitrou
Copy link
Member

pitrou commented Mar 8, 2024

As discussed on the GH issue, it would be useful to add a test using output from your machine as a reference.
(also, perhaps open a separate PR with just the test to validate that it would fail on CI)

@amoeba
Copy link
Member

amoeba commented Apr 2, 2024

Hi @noamross, would you be interested in adding the test described above to this PR? If not, I could take a shot at it.

@noamross
Copy link
Contributor Author

noamross commented Apr 3, 2024

@amoeba Sorry to open this and abandon it, it just became a very busy month and my C++ is pretty limited. I'd be happy if you would!

@amoeba
Copy link
Member

amoeba commented Apr 3, 2024

No problem at all @noamross, it's awesome you filed a PR for the issue and it's very much appreciated. I'll work on adding and testing the test.

@noamross
Copy link
Contributor Author

noamross commented Apr 3, 2024

Thanks! I expect the files attached to #40202 (comment) could serve for a test of whether a generated file is identical to a reference from a different system, but I note that they would change with the R version being used.

@amoeba
Copy link
Member

amoeba commented Apr 4, 2024

I put up a draft PR with a test to exercise this and hopefully watch it fail on CI at #41018. It fails on my amd64 linux machine so I think it's close to what we need here. I'll watch CI on that PR and report back.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 12, 2024
@amoeba
Copy link
Member

amoeba commented Apr 12, 2024

@github-actions crossbow submit -g cpp

Copy link

Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/8668333493

@amoeba
Copy link
Member

amoeba commented Apr 12, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: 9117945

Submitted crossbow builds: ursacomputing/crossbow @ actions-e9bd83c5da

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@amoeba
Copy link
Member

amoeba commented Apr 13, 2024

Things look good locally (testing on macOS-aarch64-clang and Linux-amd64-gcc) but I see an issue with my includes in a few of the above jobs. I'll fix those soon.

@amoeba
Copy link
Member

amoeba commented May 1, 2024

@github-actions crossbow submit -g cpp

Copy link

github-actions bot commented May 1, 2024

Revision: c3af79d

Submitted crossbow builds: ursacomputing/crossbow @ actions-08f54365f0

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@amoeba amoeba marked this pull request as ready for review May 1, 2024 01:31
@amoeba
Copy link
Member

amoeba commented May 1, 2024

I've pushed up a fix to address the includes issue and the crossbow jobs looked clear (failures are unrelated and the test we expect to pass does indeed pass).

I've marked this as ready for review. @pitrou do you have time to review this? cc @kou

@amoeba amoeba force-pushed the feature/deterministic-fb-serialization branch from 11958e1 to 4bd0339 Compare May 16, 2024 00:24
@amoeba
Copy link
Member

amoeba commented May 16, 2024

Thanks @pitrou for the review. I rebased from apache/main, accepted your three changes as-is, and added one extra change I caught while dealing with conflicts on rebase, 4bd0339. In 4c8eff2 @kou tweaked the CMakeList and I think this PR should make the same change as was made there.

PS: After pushing I found the ipc/CMakeLists.txt needed formatting, so that was added here too.

@amoeba
Copy link
Member

amoeba commented May 16, 2024

Parquet tests were failing all over on CI and due to needing to bump submodules so I pushed 4ebe53b so we could see CI.

@pitrou
Copy link
Member

pitrou commented May 16, 2024

@github-actions crossbow submit -g cpp

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there are still spurious parquet-testing updates in this PR?

Copy link

Revision: 4ebe53b

Submitted crossbow builds: ursacomputing/crossbow @ actions-d1985bc136

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, can you switch the edit of parquet-testing back?

@amoeba
Copy link
Member

amoeba commented May 16, 2024

Thanks all. The submodule change to help CI pass has been reverted in this PR. I'm merging this now.

@amoeba amoeba merged commit 74f7578 into apache:main May 16, 2024
34 of 36 checks passed
@amoeba amoeba removed the awaiting merge Awaiting merge label May 16, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 74f7578.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

@github-actions github-actions bot added the awaiting changes Awaiting changes label May 17, 2024
amoeba added a commit that referenced this pull request May 17, 2024
### Rationale for this change

#40392 introduced a mistake in the the associated CMakeLists.txt, noticed by @ kou in #40392 (comment).

### What changes are included in this PR?

A reversion of that change.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ic (apache#40392)

### Rationale for this change

This is the start of a PR to address apache#40361, and in turn apache#40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration.  This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.

### What changes are included in this PR?

So far I have only made a partial change to part of the metadata serialization.  I need to look at whether other calls to flatbuffers require similar treatment.

### Are these changes tested?

Not yet, this is a draft PR

### Are there any user-facing changes?

No 

* GitHub Issue: apache#40361

Lead-authored-by: Noam Ross <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
### Rationale for this change

apache#40392 introduced a mistake in the the associated CMakeLists.txt, noticed by @ kou in apache#40392 (comment).

### What changes are included in this PR?

A reversion of that change.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants