You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We’ve seen several reports about the hash join not working for large inputs (e.g., #34474, #37655, and #36995). The reason turns out to be that the row table (the hash table for the hash join) uses uint32_t to represent the row offset within the row data buffer, effectively preventing the row data from exceeding 4GB.
What makes things worse is that, when this limitation is exceeded, users can barely workaround it by regular methods like "splitting the input into smaller batches," which works for many other issues. Because the row table accumulates all the input data, smaller batches do not change the overall data size.
…bit (#43389)
### Rationale for this change
The row table uses `uint32_t` as the row offset within the row data buffer, effectively limiting the row data from growing beyond 4GB. This is quite restrictive, and the impact is described in more detail in #43495. This PR proposes to widen the row offset from 32-bit to 64-bit to address this limitation.
#### Benefits
Currently, the row table has three major limitations:
1. The overall data size cannot exceed 4GB.
2. The size of a single row cannot exceed 4GB.
3. The number of rows cannot exceed 2^32.
This enhancement will eliminate the first limitation. Meanwhile, the second and third limitations are less likely to occur. Thus, this change will enable a significant range of use cases that are currently unsupported.
#### Overhead
Of course, this will introduce some overhead:
1. An extra 4 bytes of memory consumption for each row due to the offset size difference from 32-bit to 64-bit.
2. A wider offset type requires a few more SIMD instructions in each 8-row processing iteration.
In my opinion, this overhead is justified by the benefits listed above.
### What changes are included in this PR?
Change the row offset of the row table from 32-bit to 64-bit. Relative code in row comparison/encoding and swiss join has been updated accordingly.
### Are these changes tested?
Test included.
### Are there any user-facing changes?
Users could potentially see higher memory consumption when using acero's hash join and hash aggregation. However, on the other hand, certain use cases used to fail are now able to complete.
* GitHub Issue: #43495
Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Describe the enhancement requested
We’ve seen several reports about the hash join not working for large inputs (e.g., #34474, #37655, and #36995). The reason turns out to be that the row table (the hash table for the hash join) uses
uint32_t
to represent the row offset within the row data buffer, effectively preventing the row data from exceeding 4GB.What makes things worse is that, when this limitation is exceeded, users can barely workaround it by regular methods like "splitting the input into smaller batches," which works for many other issues. Because the row table accumulates all the input data, smaller batches do not change the overall data size.
There are also some other aspects:
CompareColumnsToRows
#42188, and GH-43202: [C++][Compute] Detect and explicit error for offset overflow in row table #43226 are fixes to address or detect certain edge cases related to the row offset. Even for the fixed-length code path, which doesn’t deal with the offset buffer at all and thus is supposed to be less problematic, there are obvious offset overflow issues like [1] and [2] (these issues are currently unreported but observed in my local experiments).Therefore, we should consider widening the row offset of the row table to 64-bit.
[1]
arrow/cpp/src/arrow/compute/row/compare_internal.cc
Line 108 in 187197c
[2]
arrow/cpp/src/arrow/compute/row/compare_internal_avx2.cc
Lines 243 to 244 in 187197c
Component(s)
C++
The text was updated successfully, but these errors were encountered: