-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Segfault when collecting parquet dataset query results #41813
Comments
Update:
In particular, when I first tried re-running the script on Friday, I used the existing renv.lock file. Because of the R version change it probably downloaded the packages again, not sure. Update 2: I upgraded arrow to Update 3: I cannot reproduce if I specify Sys.setenv(ARROW_USER_SIMD_LEVEL = "avx") before loading Arrow |
This is interesting, what kind of computer (and ideally, specific model of CPU) are you running on? |
Oh, sorry, I thought I specified that somewhere! Here's the first core of
The VM has 32 cores total. I'm still working on removing PII from the dataset so I can share it. It'll still be a few GB, any place I could put it once it's done? |
Hey @mrd0ll4r, you could try uploading to my Dropbox if it's under 3GB and from there I can make it available to others: https://www.dropbox.com/request/bGhCIALbSd8izywgWSHi. Otherwise Google Drive might work fine. @zanmato1984, do you have any pointers here? I know you have some experience in some of these pieces. |
I've created a reproducer and uploaded it to dropbox, thanks! When you open the project, Comment/uncomment the environement-setting line to trigger the bug. Good luck and thanks for looking into this! Let me know if you need any additional info or resources. Edit: I uploaded the core dump and apport report as well. The latter also contains the core dump, plus some additional info, but uses whatever format |
I did fixed a bug #39577 related to Besides, quoting @mrd0ll4r :
But the fix is prior to 16.0.0. So this could be a new issue. @amoeba How can I access the archive in your dropbox? I can give it a try to reproduce. |
Hi @zanmato1984, I made the files public at https://www.dropbox.com/scl/fo/5ao8vij4kogb16a3i1l63/AJ_fT08eJrtY-ouuh-nKOoc?rlkey=7ug6htsro09agqyuw6afoesvt&st=nar7b740&dl=0. Let me know if that doesn't work. |
I'm now trying to reproduce using C++, this is probably more efficient considering my poor experience on R. The R reproduction script and the data seem very helpful. Will get back when I find something.
|
Thanks @zanmato1984. I tried to reproduce the issue last night and I was able to reproduce it (which is good) so I'm going to reproduce with a debug build hopefully later today. |
Cool @amoeba ! Let me know if you need any help from me on your debugging. Meanwhile I'll keep my debugging as well. |
I did get more info from a debug build, so this is something to go off of:
Edit: To add |
I reproduced the bug in C++ and found about the same thing. (For anyone who is interested, checkout my repro branch https://github.com/zanmato1984/arrow/tree/fix-41813-repro. Note you'll still need the data in the repro archive.)
|
The bug is that in this line:
If a slot of offset_right contains a value >= 0x80000000 , which is an offset in row bigger than 2GB , then it is added to right_base as a negative integer, causing gathering data from an invalid address.
Proval followed:
Further decoding each slot of
Note that the last offset is larger than I'm working on a fix. |
Awesome work @zanmato1984, thank you for figuring this out. |
…umnsToRows` (#42188) ### Rationale for this change AVX2 intrinsics `_mm256_i32gather_epi32`/`_mm256_i32gather_epi64` are used in `CompareColumnsToRows` API, and treat the `vindex` as signed integer. In our row table implementation, we use `uint32_t` to represent the offset within the row table. When a offset is larger than (`0x80000000`, or `2GB`), the aforementioned intrinsics will treat it as negative offset and gather the data from undesired address. More details please see #41813 (comment). Considering there is no unsigned-32bit-offset or 64bit-offset counterparts of those intrinsics in AVX2, this issue can be simply mitigated by translating the base address and the offset: ``` new_base = base + 0x80000000; new_offset = offset - 0x80000000; ``` ### What changes are included in this PR? Fix and UT that reproduces the issue. ### Are these changes tested? UT included. ### Are there any user-facing changes? None. * GitHub Issue: #41813 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Issue resolved by pull request 42188 |
#45108) ### Rationale for this change #44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now): 1. The first issue is, our swiss table implementation takes the higher `N` bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains `8` key - in some code also referred to as "group" - ids). This `N`-bit number is further multiplied by the size of a block, which is also related to `N`. The `N` in the case of #44513 is `26` and a block takes `40` bytes. So the multiply is possible to produce a number over `1 << 31` (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal_avx2.cc#L404 , the issue like #41813 (comment) shows up. This is the actual issue that directly produced the segfault in #44513. 2. The other issue is, we take `7` bits of the 32-bit hash value after `N` as a "stamp" (to quick fail the hash comparison). But when `N` is greater than `25`, some arithmetic code like https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal.cc#L397 (`bits_hash_` is `constexpr 32`, `log_blocks_` is `N`, `bits_stamp_` is `constexpr 7`, this is to retrieve the stamp from a hash) produces `hash >> -1` aka `hash >> 0xFFFFFFFF` aka `hash >> 31` (the heading `1`s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in #44513 (comment) . ### What changes are included in this PR? For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow. For issue 2, do not right-shift the hash if `N + 7 >= 32`. This is actually allowing the bits overlapping between block id (the `N` bits) and stamp (the `7` bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for `N > 25`. I introduce two members `bits_shift_for_block_and_stamp_` and `bits_shift_for_block_`, which are derived from `log_blocks_` - esp. set to `0` and `32 - N` when `N + 7 >= 32`, this is to avoid branching like `if (log_blocks_ + bits_stamp_ > bits_hash_)` in tight loops. ### Are these changes tested? The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.) ### Are there any user-facing changes? None. * GitHub Issue: #44513 Lead-authored-by: Rossi Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Rossi Sun <[email protected]>
#45108) ### Rationale for this change #44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now): 1. The first issue is, our swiss table implementation takes the higher `N` bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains `8` key - in some code also referred to as "group" - ids). This `N`-bit number is further multiplied by the size of a block, which is also related to `N`. The `N` in the case of #44513 is `26` and a block takes `40` bytes. So the multiply is possible to produce a number over `1 << 31` (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal_avx2.cc#L404 , the issue like #41813 (comment) shows up. This is the actual issue that directly produced the segfault in #44513. 2. The other issue is, we take `7` bits of the 32-bit hash value after `N` as a "stamp" (to quick fail the hash comparison). But when `N` is greater than `25`, some arithmetic code like https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal.cc#L397 (`bits_hash_` is `constexpr 32`, `log_blocks_` is `N`, `bits_stamp_` is `constexpr 7`, this is to retrieve the stamp from a hash) produces `hash >> -1` aka `hash >> 0xFFFFFFFF` aka `hash >> 31` (the heading `1`s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in #44513 (comment) . ### What changes are included in this PR? For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow. For issue 2, do not right-shift the hash if `N + 7 >= 32`. This is actually allowing the bits overlapping between block id (the `N` bits) and stamp (the `7` bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for `N > 25`. I introduce two members `bits_shift_for_block_and_stamp_` and `bits_shift_for_block_`, which are derived from `log_blocks_` - esp. set to `0` and `32 - N` when `N + 7 >= 32`, this is to avoid branching like `if (log_blocks_ + bits_stamp_ > bits_hash_)` in tight loops. ### Are these changes tested? The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.) ### Are there any user-facing changes? None. * GitHub Issue: #44513 Lead-authored-by: Rossi Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Rossi Sun <[email protected]>
#45108) ### Rationale for this change #44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now): 1. The first issue is, our swiss table implementation takes the higher `N` bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains `8` key - in some code also referred to as "group" - ids). This `N`-bit number is further multiplied by the size of a block, which is also related to `N`. The `N` in the case of #44513 is `26` and a block takes `40` bytes. So the multiply is possible to produce a number over `1 << 31` (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal_avx2.cc#L404 , the issue like #41813 (comment) shows up. This is the actual issue that directly produced the segfault in #44513. 2. The other issue is, we take `7` bits of the 32-bit hash value after `N` as a "stamp" (to quick fail the hash comparison). But when `N` is greater than `25`, some arithmetic code like https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal.cc#L397 (`bits_hash_` is `constexpr 32`, `log_blocks_` is `N`, `bits_stamp_` is `constexpr 7`, this is to retrieve the stamp from a hash) produces `hash >> -1` aka `hash >> 0xFFFFFFFF` aka `hash >> 31` (the heading `1`s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in #44513 (comment) . ### What changes are included in this PR? For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow. For issue 2, do not right-shift the hash if `N + 7 >= 32`. This is actually allowing the bits overlapping between block id (the `N` bits) and stamp (the `7` bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for `N > 25`. I introduce two members `bits_shift_for_block_and_stamp_` and `bits_shift_for_block_`, which are derived from `log_blocks_` - esp. set to `0` and `32 - N` when `N + 7 >= 32`, this is to avoid branching like `if (log_blocks_ + bits_stamp_ > bits_hash_)` in tight loops. ### Are these changes tested? The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.) ### Are there any user-facing changes? None. * GitHub Issue: #44513 Lead-authored-by: Rossi Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Rossi Sun <[email protected]>
Hello!
I've been using arrow with R for a while now to great success.
Recently, I've re-opened an old project (managed with renv, so I'm pretty confident all the package versions were the same).
It is possible I upgraded the OS and/or OS packages in the meantime.
Now, some of my queries on a gzip-compressed dataset of parquet files lead to a segfault:
I have a core dump from that session, but it's 46GB.
The machine has 256GB RAM and another 256GB swap, so I'm confident that's not the problem.
I'm not a professional in analyzing these things, but this is what I got:
I've tried:
15.0.1
from RSPM. Above crash is from this version.dataset %>% summarize(n=n())
) work, which they doSpecifically, this query works:
and this doesn't:
The dataset looks like this:
It's 3GB on disk, gzip compressed.
Unfortunately, I cannot share the dataset publicly as it contains sensitive information.
Overall, pretty lost now.
The system is running Ubuntu 22.04, kernel:
Hope that helps somehow...
Component(s)
R
The text was updated successfully, but these errors were encountered: