Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpuminer-opt segfault when compiled with Clang #440

Open
JayDDee opened this issue Dec 16, 2024 · 4 comments
Open

cpuminer-opt segfault when compiled with Clang #440

JayDDee opened this issue Dec 16, 2024 · 4 comments

Comments

@JayDDee
Copy link
Owner

JayDDee commented Dec 16, 2024

With the release of v24.7 cpuminer-opt can now be compiled for Windows 11 on ARM CPUs with caveats.

A mysterious segfault can occur in stratum code. The code implicated by the segfault is in util.c:send_line.
The segfault was traced to corruption of the sctx pointer with the lower 32 bits zeroed. This suggests either stack or register corruption since sctx pointer was passed as a function pointer.
Neither Linux on ARM nor Windows on x86_64 have this problem.

Both Windows and MSys2/MingW are immature on ARM and may be the cause. The test system is a VM running on an Apple host which may also be the issue. Finally it may be an issue with cpuminer-opt that only affects ARM & Windows but that seems unlikely at this time.

While debugging by adding printf to capture specific data the segfault magically disappeared. This made it difficult to pinpont the exact location where the lower 32 bit of sctx pointer get zeroed. But it also provided a workaround that either prevented the corruption or move the corruption to some benign data.
This workaround has been made available in v27.4 and can be enabled by adding an option to CFLAGS during compilation.

While debugging I also tried moving the printf around and it only worked if inside the while loop before the curl call.
I also tried sleep(1) in place of printf, which worked in preventing the segfault. However usleep, even with 1,000,000 usecs did not prevent the segfault. This suggests it's not a timing issue or race condition. There's also no one to race with as all miner threads are wating for data from stratum.
It's should also be noted that the segfault occurs while holding the sock mutex and in general printf should be avoided in mutex regions. Ironically, in this case it serves to workaround a segfault. This could be a hint.

For the time being I'd like to identify which component is responsible for the segfault so pleasse report your experiences.

Building for Windows 11 on AArch64:

Follow the procedure for building for Linux on ARM with the following changes.

Msys2 does not yet support GCC on ARM so use CLANG. Install the clang mingw64 packages & use the clang shell environment.

The build procedure requires a combination of CFLAGS from the build-msys2.sh, arm-build.sh scripts, and possibly the flag to workaround the segfault.
There is no build script for Windows on ARM, enter commands manually.

To compile without the workaround run this configure command:

$ CFLAGS="-O3 -march=native -Wall -flax-vector-conversions -D_WIN32_WINNT=0x0601" ./configure --with-curl

If a segfault occurs make clean & recompile adding "-DARM_WIN_HACK" to CFLAGS:

$ CFLAGS="-O3 -march=native -Wall -flax-vector-conversions -D_WIN32_WINNT=0x0601 -DARM_WIN_HACK" ./configure --with-curl

Please report your results whether it works without the hack, only with the hack or not at all.
Also report any other issues that may arise.
Please include the CPU model, Windows 11 build & other pertinent details in reports.

Reports can be made preferably here, a post in bitcointalk forum or by email.

Other errata with Windows on ARM:

  • No precompiled binaries are provided, must be compiled from source.
  • CPU & SW feature reporting is not working.
  • A warning about CPU affinity is sometimes displayed, it can be ignored.
  • Clang produces many assembler warnings during compilation, they can be ignored.
  • Clang is not reported as the compiler used for the build. A fix has already been found and will be in the next release.
@JayDDee JayDDee changed the title Windows 11 and aarch64 V24.7 Windows 11 and aarch64 Dec 17, 2024
@JayDDee JayDDee changed the title V24.7 Windows 11 and aarch64 V24.7 Windows 11 and AArch64 Dec 17, 2024
@JayDDee JayDDee changed the title V24.7 Windows 11 and AArch64 cpuminer-opt segfault when compiled with Clang Dec 30, 2024
@JayDDee
Copy link
Owner Author

JayDDee commented Dec 30, 2024

A very similar, if not identical segfault was seen on MacOS x86_64, a different CPU architecture and different OS. It was also observed the crash can occur on return from send_line, also suggesting stack corruption. It also suggests some randomness to the symptoms depending on what data has been corrupted.
The documented workaround for Windows on ARM64 did not work on MacOS x86_64 confirming the workaround was just dumb luck.
The main common thread between these 2 samples is the use of Clang. When compiled with GCC on MacOS x86_64 it did not crash.
This also suggests MacOS on ARM4 compiled with Clang working may also be dumb luck. Fortunately MacOS can use GCC on both x86_64 and ARM64 and will become the recommended build procedure in the next release.
Unfortunately GCC Is not yet available for MSys2 on ARM64 so Windows on ARM64 will continue to be unpredictable and unsuportable.
The title of this issue will be changed to add Clang.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 4, 2025

Not a lot of progress.
I rebuilt the MSys2 environment and nailed down precisely which packages are needed (Wiki updated). I also found the ARM64 package of Jansson so now cpuminer-opt will include the installed Jansson instead of compiling the source included in cpuminer-opt when building for Windows on ARM64.
This made no difference in the segfault. It still crashes without the workaround and works with the workaround. This eliminates Jansson as a possible cause.
The common factors in the 2 crash configurations are the use of Clang and using a VM. The host OSs are different, the guest OSs are different and the VM SW is different (qemu vs Virtualbox) but they're still VMs.
The Clang versions are also different.
More data, that is more samples, are needed for better focus.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 6, 2025

Here's a trace of the segfault created by adding printf that didn't affect the crash. The messages are self explanatory and show the lower 32 bits of the sctx pointer set to all 0. Additionally iit show it changed betwen the end of stratum_send_line and returning from stratum_send_line with no code in between except the function return.
The pointer is allocated on the stack, right after char *s, however it's likely it was in a register at the time. Only the lower 32 bits were trampled so it was a 32 bit operation performed on a 64 bit value without disturbing the uppper 32 bits. I see no legitimate reason to perform this kind of operation on a pointer.
Attempts to display &sctx showed an unchanged pointer value but a segfault still occured.

stratum_connect return
stratum_subscribe enter
sctx 00007FF64DBDB480: 85032770 000001dd 8506f540 000001dd
stratum_send_line returning
sctx 00007FF64DBDB480: 85032770 000001dd 8506f540 000001dd
s: 000001DD85103360: 7b 22 69 64 22 3a 20 31
stratum_subscribe: returned fron stratum_send_line
sctx 00007FF600000000
Segmentation fault

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 7, 2025

Updating Windows-11 from 23h2 to 24h2 made no difference. That was expected because the OS is not suspect.
I was able to confirm register corruption specifically the lower 32 bits set to all zero. In the lldb log below note:

  • the fault address: 0x7ff700000118
  • the fault instruction: ldr x8, [x19,#0x118]
  • the register dump: x19 = 0x00007ff700000000

Process 2976 stopped
* thread #6, stop reason = Exception 0xc0000005 encountered at address 0x7ff79da3c7e0: Access violation reading location 0x7ff700000118
frame #0: 0x00007ff79da3c7e0 cpuminer.exe stratum_subscribe + 280
cpuminer.exe stratum_subscribe:
-> 0x7ff79da3c7e0 <+280>: ldr x8, [x19, #0x118]
0x7ff79da3c7e4 <+284>: mov w9, #0x1 ; =1
0x7ff79da3c7e8 <+288>: add x1, sp, #0x110
0x7ff79da3c7ec <+292>: add x4, sp, #0x318
(lldb) register read
General Purpose Registers:
x0 = 0x0000000000000001
x1 = 0x000001b34517ae40
x2 = 0x0000000000000001
x3 = 0x00007ffbca3e4630 .refptr.ossl_cc_newreno_method + 9992
x4 = 0x00000000000001d3
x5 = 0x0000000000000000
x6 = 0x0000000000000001
x7 = 0x0000000000000000
x8 = 0x0000000000000001
x9 = 0x0000000000000000
x10 = 0x0000000000000000
x11 = 0x0000000000000000
x12 = 0x0000000000000000
x13 = 0x0000000000000000
x14 = 0x0000000000000000
x15 = 0x0000000000000000
x16 = 0x00007ffc076a386c libwinpthread-1.dll pthread_mutex_unlock
x17 = 0x0000000000000000
x18 = 0x0000000000000000
x19 = 0x00007ff700000000
x20 = 0x0000000000000000
x21 = 0x0000000000000000
x22 = 0x0000000faacff8b0
x23 = 0x000001b345619d90
x24 = 0x00007ff79dbbb9e0 g_work_lock
x25 = 0x00007ffc24accf60 ws2_32.dll select
x26 = 0x00007ff79db600fa .refptr.optind + 1754
x27 = 0x0000000000000000
x28 = 0x0000000000000000
fp = 0x00007ff79db60070 .refptr.optind + 1616
lr = 0x00007ff79da3c8dc cpuminer.exe stratum_subscribe + 532
sp = 0x0000000faacff7a0
pc = 0x00007ff79da3c7e0 cpuminer.exe stratum_subscribe + 280
cpsr = 0x80000040
(lldb)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant