[C++][Compute] Consider widening the row offset of the row table to 64-bit #43495

zanmato1984 · 2024-07-31T09:29:03Z

Describe the enhancement requested

We’ve seen several reports about the hash join not working for large inputs (e.g., #34474, #37655, and #36995). The reason turns out to be that the row table (the hash table for the hash join) uses uint32_t to represent the row offset within the row data buffer, effectively preventing the row data from exceeding 4GB.

What makes things worse is that, when this limitation is exceeded, users can barely workaround it by regular methods like "splitting the input into smaller batches," which works for many other issues. Because the row table accumulates all the input data, smaller batches do not change the overall data size.

There are also some other aspects:

32-bit row offset also makes the row table implementation error-prone. For example, GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087, GH-41813: [C++] Fix avx2 gather offset larger than 2GB in CompareColumnsToRows #42188, and GH-43202: [C++][Compute] Detect and explicit error for offset overflow in row table #43226 are fixes to address or detect certain edge cases related to the row offset. Even for the fixed-length code path, which doesn’t deal with the offset buffer at all and thus is supposed to be less problematic, there are obvious offset overflow issues like [1] and [2] (these issues are currently unreported but observed in my local experiments).
We are going to support large-offset data types for the row table eventually, as requested in [C++] Add hash-join support for large-offset types (e.g. large_string) and dictionary encoded types to the new hash-join impl #31622.

Therefore, we should consider widening the row offset of the row table to 64-bit.

[1]

arrow/cpp/src/arrow/compute/row/compare_internal.cc

Line 108 in 187197c

uint32_t offset_right = irow_right * fixed_length + offset_within_row;

[2]

arrow/cpp/src/arrow/compute/row/compare_internal_avx2.cc

Lines 243 to 244 in 187197c

    
           __m256i offset_right = 
        
               _mm256_mullo_epi32(irow_right, _mm256_set1_epi32(fixed_length));

Component(s)

C++

The text was updated successfully, but these errors were encountered:

…bit (#43389) ### Rationale for this change The row table uses `uint32_t` as the row offset within the row data buffer, effectively limiting the row data from growing beyond 4GB. This is quite restrictive, and the impact is described in more detail in #43495. This PR proposes to widen the row offset from 32-bit to 64-bit to address this limitation. #### Benefits Currently, the row table has three major limitations: 1. The overall data size cannot exceed 4GB. 2. The size of a single row cannot exceed 4GB. 3. The number of rows cannot exceed 2^32. This enhancement will eliminate the first limitation. Meanwhile, the second and third limitations are less likely to occur. Thus, this change will enable a significant range of use cases that are currently unsupported. #### Overhead Of course, this will introduce some overhead: 1. An extra 4 bytes of memory consumption for each row due to the offset size difference from 32-bit to 64-bit. 2. A wider offset type requires a few more SIMD instructions in each 8-row processing iteration. In my opinion, this overhead is justified by the benefits listed above. ### What changes are included in this PR? Change the row offset of the row table from 32-bit to 64-bit. Relative code in row comparison/encoding and swiss join has been updated accordingly. ### Are these changes tested? Test included. ### Are there any user-facing changes? Users could potentially see higher memory consumption when using acero's hash join and hash aggregation. However, on the other hand, certain use cases used to fail are now able to complete. * GitHub Issue: #43495 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

pitrou · 2024-08-19T10:39:11Z

Issue resolved by pull request 43389
#43389

zanmato1984 added the Type: enhancement label Jul 31, 2024

zanmato1984 mentioned this issue Jul 31, 2024

GH-43495: [C++][Compute] Widen the row offset of the row table to 64-bit #43389

Merged

github-actions bot added the Component: C++ label Jul 31, 2024

github-actions bot assigned zanmato1984 Jul 31, 2024

zanmato1984 mentioned this issue Aug 14, 2024

[C++][Acero] AVX2 specialized swiss join functions not wired #43693

Closed

pitrou added this to the 18.0.0 milestone Aug 19, 2024

pitrou closed this as completed Aug 19, 2024

kolfild26 mentioned this issue Oct 23, 2024

[C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Compute] Consider widening the row offset of the row table to 64-bit #43495

[C++][Compute] Consider widening the row offset of the row table to 64-bit #43495

zanmato1984 commented Jul 31, 2024 •

edited

Loading

pitrou commented Aug 19, 2024

[C++][Compute] Consider widening the row offset of the row table to 64-bit #43495

[C++][Compute] Consider widening the row offset of the row table to 64-bit #43495

Comments

zanmato1984 commented Jul 31, 2024 • edited Loading

Describe the enhancement requested

Component(s)

pitrou commented Aug 19, 2024

zanmato1984 commented Jul 31, 2024 •

edited

Loading