GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

zanmato1984 · 2024-12-25T05:37:49Z

Rationale for this change

#44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now):

The first issue is, our swiss table implementation takes the higher N bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains 8 key - in some code also referred to as "group" - ids). This N-bit number is further multiplied by the size of a block, which is also related to N. The N in the case of [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513 is 26 and a block takes 40 bytes. So the multiply is possible to produce a number over 1 << 31 (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer

arrow/cpp/src/arrow/compute/key_map_internal_avx2.cc

Line 404 in 0a00e25

__m256i group_id = _mm256_i32gather_epi32(elements, pos, 1);

, the issue like [R] Segfault when collecting parquet dataset query results #41813 (comment) shows up. This is the actual issue that directly produced the segfault in [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513.
The other issue is, we take 7 bits of the 32-bit hash value after N as a "stamp" (to quick fail the hash comparison). But when N is greater than 25, some arithmetic code like

arrow/cpp/src/arrow/compute/key_map_internal.cc

Line 397 in 0a00e25

static_cast<int>((hash >> (bits_hash_ - log_blocks_ - bits_stamp_)) & stamp_mask);

(bits_hash_ is constexpr 32, log_blocks_ is N, bits_stamp_ is constexpr 7, this is to retrieve the stamp from a hash) produces hash >> -1 aka hash >> 0xFFFFFFFF aka hash >> 31 (the heading 1s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513 (comment) .

What changes are included in this PR?

For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow.

For issue 2, do not right-shift the hash if N + 7 >= 32. This is actually allowing the bits overlapping between block id (the N bits) and stamp (the 7 bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for N > 25. I introduce two members bits_shift_for_block_and_stamp_ and bits_shift_for_block_, which are derived from log_blocks_ - esp. set to 0 and 32 - N when N + 7 >= 32, this is to avoid branching like if (log_blocks_ + bits_stamp_ > bits_hash_) in tight loops.

Are these changes tested?

The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.)

Are there any user-facing changes?

None.

GitHub Issue: [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513

github-actions · 2024-12-25T05:38:13Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2024-12-25T05:38:52Z

⚠️ GitHub issue #44513 has been automatically assigned in GitHub to PR creator.

zanmato1984 · 2024-12-27T11:28:12Z

Hi @pitrou , would you help to take a look? Thanks.

pitrou · 2025-01-06T16:02:06Z

cpp/src/arrow/compute/key_map_internal_avx2.cc

+      // This is to prevent index overflow issues in GH-44513.
+      // NB: Use zero-extend conversion for unsigned hash.
+      __m256i hash_lo = _mm256_cvtepu32_epi64(_mm256_castsi256_si128(hash));
+      __m256i hash_hi = _mm256_cvtepu32_epi64(_mm256_extracti128_si256(hash, 1));
      __m256i local_slot =
          _mm256_set1_epi64x(reinterpret_cast<const uint64_t*>(local_slots)[i]);
      local_slot = _mm256_shuffle_epi8(


Hmm... so this first expands _mm256_shuffle_epi8 from 8-bit to 32-bit lanes, and then _mm256_cvtepi32_epi64 below expands it from 32-bit to 64-bit lanes? Would it be quicker to shuffle directly from 8-bit to 64-bit (twice, I suppose)

(interestingly, _mm256_shuffle_epi8 is faster than _mm256_cvtepi32_epi64 according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_shuffle_epi8&ig_expand=1798,6006,1628,6006)

I was thinking that we can save one multiply of local_offset * byte_size. But yeah, once we shuffled to 64-bit lanes, we can use _mm256_mul_epi32 (5 cycles) to replace _mm256_mullo_epi32 (10 cycles), then we have 2 _mm256_shuffle_epi8s (1 cycle each) + 2 _mm256_mul_epi32s = 12 cycles in total, VS., 1 _mm256_shuffle_epi8 + 1 _mm256_mullo_epi32 + 2 _mm256_cvtepi32_epi64 (3 cycles each) = 17 cycles in total, which is still a win.

I've updated. Thank you for this.

pitrou · 2025-01-06T16:05:25Z

cpp/src/arrow/compute/key_map_internal_avx2.cc

+      __m256i local_slot_hi =
+          _mm256_cvtepi32_epi64(_mm256_extracti128_si256(local_slot, 1));
+      __m256i pos_lo =
+          _mm256_srlv_epi64(hash_lo, _mm256_set1_epi64x(bits_hash_ - log_blocks_));


By the way, why not _mm256_srli_epi64(hash_lo, bits_hash_ - log_blocks_)?

Just copied from the original code, plus I wasn't aware of _mm256_srli_epi64 then - still learning :)

Updated here and a couple of other unnecessary vector shifting. Thank you!

pitrou · 2025-01-06T16:07:28Z

cpp/src/arrow/compute/key_map_internal_avx2.cc

+      pos_lo = _mm256_mul_epi32(pos_lo, _mm256_set1_epi32(byte_multiplier));
+      pos_hi = _mm256_mul_epi32(pos_hi, _mm256_set1_epi32(byte_multiplier));


For the record, why are we multiplying in the signed domain rather than unsigned?

Yeah we should use unsigned multiply.

But actually they are the same in this specific case (i.e., both operands are less than 0x80000000 - note the log_blocks_ is strictly less than 32). Even the result is larger than uint32_max, _mm256_mul_epi32 won't do sign-extension.

Anyway, I'll update. Thank you.

zanmato1984 · 2025-01-07T03:18:37Z

@ursabot please benchmark

zanmato1984 · 2025-01-07T06:40:35Z

@ursabot please benchmark

ursabot · 2025-01-07T06:40:39Z

Commit 4462ceb already has scheduled benchmark runs.

conbench-apache-arrow · 2025-01-09T02:37:44Z

Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit 4462ceb.

There were 29 benchmark results with an error:

Pull Request Run on amd64-m5-4xlarge-linux at 2025-01-08 19:19:03Z
- tpch
- tpch
and 27 more (see the report linked below)

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

zanmato1984 · 2025-01-13T10:40:32Z

Hi @pitrou , can we move on with this?

pitrou

LGTM except for a potential typo that should probably have failed the tests??
(does this lack test coverage?)

cpp/src/arrow/compute/key_map_internal_avx2.cc

zanmato1984 · 2025-01-13T12:52:36Z

@github-actions crossbow submit -g cpp

github-actions · 2025-01-13T12:55:09Z

Revision: 4462ceb

Submitted crossbow builds: ursacomputing/crossbow @ actions-6f38216180

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

zanmato1984 · 2025-01-13T12:56:47Z

The above crossbow is to check if the typo identified in one of the review fails any tests.

Co-authored-by: Antoine Pitrou <[email protected]>

pitrou · 2025-01-13T13:37:44Z

The above crossbow is to check if the typo identified in one of the review fails any tests.

It doesn't seem to, which sounds worrying. Could we check whether 1) the given codepath is actually not called anywhere due to a logic bug, or 2) the given codepath is currently not exercised by the test suite? Either way, it would deserve fixing IMHO.

pitrou · 2025-01-13T13:39:25Z

Now I've ran them on my BMI2 machine and confirmed that this typo failed both the UT and the reported case. And fixing the typo passed them again.

Oh, thank you. You can of course disregard my previous comment, then.

zanmato1984 · 2025-01-13T14:22:00Z

The above crossbow is to check if the typo identified in one of the review fails any tests.

It doesn't seem to, which sounds worrying. Could we check whether 1) the given codepath is actually not called anywhere due to a logic bug, or 2) the given codepath is currently not exercised by the test suite? Either way, it would deserve fixing IMHO.

Yeah this is merely telling us that there is no BMI2 capable machine in our CI, which is still worrying but less than that of if our test didn't exercise the code enough.

zanmato1984 · 2025-01-13T14:29:45Z

Merging. Thank you @pitrou for the thorough review!

conbench-apache-arrow · 2025-01-16T03:08:03Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 32fcd18.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 42 possible false positives for unstable benchmarks that are known to sometimes produce them.

#45108) ### Rationale for this change #44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now): 1. The first issue is, our swiss table implementation takes the higher `N` bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains `8` key - in some code also referred to as "group" - ids). This `N`-bit number is further multiplied by the size of a block, which is also related to `N`. The `N` in the case of #44513 is `26` and a block takes `40` bytes. So the multiply is possible to produce a number over `1 << 31` (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal_avx2.cc#L404 , the issue like #41813 (comment) shows up. This is the actual issue that directly produced the segfault in #44513. 2. The other issue is, we take `7` bits of the 32-bit hash value after `N` as a "stamp" (to quick fail the hash comparison). But when `N` is greater than `25`, some arithmetic code like https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal.cc#L397 (`bits_hash_` is `constexpr 32`, `log_blocks_` is `N`, `bits_stamp_` is `constexpr 7`, this is to retrieve the stamp from a hash) produces `hash >> -1` aka `hash >> 0xFFFFFFFF` aka `hash >> 31` (the heading `1`s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in #44513 (comment) . ### What changes are included in this PR? For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow. For issue 2, do not right-shift the hash if `N + 7 >= 32`. This is actually allowing the bits overlapping between block id (the `N` bits) and stamp (the `7` bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for `N > 25`. I introduce two members `bits_shift_for_block_and_stamp_` and `bits_shift_for_block_`, which are derived from `log_blocks_` - esp. set to `0` and `32 - N` when `N + 7 >= 32`, this is to avoid branching like `if (log_blocks_ + bits_stamp_ > bits_hash_)` in tight loops. ### Are these changes tested? The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.) ### Are there any user-facing changes? None. * GitHub Issue: #44513 Lead-authored-by: Rossi Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Rossi Sun <[email protected]>

zanmato1984 added 16 commits December 25, 2024 13:28

Repro

3a3b23c

Fix

d933802

Update test

a54025b

Fix and test WIP

395ad09

Fix

c756f08

Comment test

d33b488

Fix

c824983

Fix

53fe09f

Fix

74ff314

Hack

e79cca3

Fix done

289494f

Fix avx2 path index overflow

936b72a

Fix avx2 path index overflow

71350f3

Fix avx2 path index overflow

3ffedd5

Fix sign-extend for hash 32 to 64

13b69a7

Update test

af07e9b

zanmato1984 requested a review from westonpace as a code owner December 25, 2024 05:37

zanmato1984 marked this pull request as draft December 25, 2024 05:37

github-actions bot added Component: C++ awaiting review Awaiting review labels Dec 25, 2024

zanmato1984 changed the title ~~[C++] Fix for large build side in swiss join~~ GH-44513: [C++] Fix for large build side in swiss join Dec 25, 2024

zanmato1984 changed the title ~~GH-44513: [C++] Fix for large build side in swiss join~~ GH-44513: [C++] Fix overflow issues for large build side in swiss join Dec 26, 2024

zanmato1984 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 26, 2024

Some cleanup

fe35443

zanmato1984 force-pushed the fix-gh44513 branch from 4bc9967 to fe35443 Compare December 27, 2024 11:01

Comment about new variables

2913796

zanmato1984 marked this pull request as ready for review December 27, 2024 11:27

pitrou reviewed Jan 6, 2025

View reviewed changes

zanmato1984 added 5 commits January 7, 2025 08:51

Replace vector shifting with immediate number version

c914689

Directly shuffle 8 bytes to 8 64-bit lanes

86adc3d

Refine comment

8762886

Factor out computation of bits to right shift into functions

37c3948

Change signed mulitply to unsigned

4462ceb

zanmato1984 mentioned this pull request Jan 13, 2025

[C++] Fix ASAN issue in the arrow-dataset-dataset-writer-test #45235

Open

pitrou requested changes Jan 13, 2025

View reviewed changes

cpp/src/arrow/compute/key_map_internal_avx2.cc Outdated Show resolved Hide resolved

Update cpp/src/arrow/compute/key_map_internal_avx2.cc

74dbb6e

Co-authored-by: Antoine Pitrou <[email protected]>

pitrou approved these changes Jan 13, 2025

View reviewed changes

zanmato1984 merged commit 32fcd18 into apache:main Jan 13, 2025
39 checks passed

zanmato1984 removed the awaiting committer review Awaiting committer review label Jan 13, 2025

zanmato1984 mentioned this pull request Jan 13, 2025

[C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

zanmato1984 commented Dec 25, 2024 •

edited

Loading

github-actions bot commented Dec 25, 2024

github-actions bot commented Dec 25, 2024

zanmato1984 commented Dec 27, 2024

pitrou Jan 6, 2025

zanmato1984 Jan 7, 2025

pitrou Jan 6, 2025

zanmato1984 Jan 7, 2025

pitrou Jan 6, 2025

zanmato1984 Jan 7, 2025

zanmato1984 Jan 7, 2025

zanmato1984 commented Jan 7, 2025

zanmato1984 commented Jan 7, 2025

ursabot commented Jan 7, 2025

conbench-apache-arrow bot commented Jan 9, 2025

zanmato1984 commented Jan 13, 2025

pitrou left a comment •

edited

Loading

zanmato1984 commented Jan 13, 2025

github-actions bot commented Jan 13, 2025

zanmato1984 commented Jan 13, 2025

pitrou commented Jan 13, 2025

pitrou commented Jan 13, 2025

zanmato1984 commented Jan 13, 2025

zanmato1984 commented Jan 13, 2025

conbench-apache-arrow bot commented Jan 16, 2025

		pos_lo = _mm256_mul_epi32(pos_lo, _mm256_set1_epi32(byte_multiplier));
		pos_hi = _mm256_mul_epi32(pos_hi, _mm256_set1_epi32(byte_multiplier));

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

Conversation

zanmato1984 commented Dec 25, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Dec 25, 2024

github-actions bot commented Dec 25, 2024

zanmato1984 commented Dec 27, 2024

pitrou Jan 6, 2025

Choose a reason for hiding this comment

zanmato1984 Jan 7, 2025

Choose a reason for hiding this comment

pitrou Jan 6, 2025

Choose a reason for hiding this comment

zanmato1984 Jan 7, 2025

Choose a reason for hiding this comment

pitrou Jan 6, 2025

Choose a reason for hiding this comment

zanmato1984 Jan 7, 2025

Choose a reason for hiding this comment

zanmato1984 Jan 7, 2025

Choose a reason for hiding this comment

zanmato1984 commented Jan 7, 2025

zanmato1984 commented Jan 7, 2025

ursabot commented Jan 7, 2025

conbench-apache-arrow bot commented Jan 9, 2025

zanmato1984 commented Jan 13, 2025

pitrou left a comment • edited Loading

Choose a reason for hiding this comment

zanmato1984 commented Jan 13, 2025

github-actions bot commented Jan 13, 2025

zanmato1984 commented Jan 13, 2025

pitrou commented Jan 13, 2025

pitrou commented Jan 13, 2025

zanmato1984 commented Jan 13, 2025

zanmato1984 commented Jan 13, 2025

conbench-apache-arrow bot commented Jan 16, 2025

zanmato1984 commented Dec 25, 2024 •

edited

Loading

pitrou left a comment •

edited

Loading