-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subsys/logging: Add compiler barriers to msg stack allocation #46967
subsys/logging: Add compiler barriers to msg stack allocation #46967
Conversation
Fixes issue we have seen using nRF connect SDK 2.0.0 whereby too many messages causes a fault. Fault no longer appears with this fix |
This freaks me out a bit. Was log message management actually intended to be a lockless algorithm? That's... really, really hard to get right (sys_winstream was two weeks of work chasing "simple" mistakes, and that's a much smaller problem). Why can't we just put a spinlock around this? Also, note that ARM64 is OOO, so a compiler barrier is likely insufficient anyway and you probably want a DSB here (did we ever merge the memory barriers patch? I can't remember if there's an API for this yet or not). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, reading carefully, this is just a variable length array, not a lockless modification of global log state.
Let me pull in @carlocaione @stephanosio @npitre , I think what we're looking at here may be an arch bug. I'm not the arm64 expert, but I'm pretty sure the ABI (just like x86_64) includes a "red zone" below the stack pointer which leaf functions are allowed to use. Is it possible that something in interrupt entry/exit is forgetting that and clobbering the area under SP incorrectly before/after the switch to the interrupt stack?
Alternatively it's just a compiler bug, in which case adding a barrier is a fine workaround, but we should probably call that out in code and predicate it on whatever platforms are broken.
Drop a -1 on this, not to prevent merge but just to make sure we either get to a root cause or commit this as a workaround with an issue to make sure we find it.
@lairdjm is this #46986 bug fixed? If yes, @keith-packard can you add |
#46986 was found on cortex-m so it's not arm64 specific. |
I'd like to see a disassembly listing of the affected code, for both
with and without the barrier. If this is indeed a case of the stack area
being stumped by a badly timed IRQ that the barrier does fix then that
will be pretty obvious from the disassembly. Normally, any allocation on
the stack should see the stack pointer being decremented before any
access (possibly using another register) to that area. If sp is
decremented after the access then you'll have the smoking gun.
Such analysis should ideally be part of the commit log as well (e.g.
commit 1db5c8b).
|
I didn't actually look at the assembly code -- what I did was check the stack pointer while the data were being written into the reserved area and discovered that it was above the temporary storage, and that with the barriers in place, it was below. This seems like a compiler bug to me. That was just after setting a memory watchpoint on the first byte of the area and catching the trap handler writing over the address of the format string. |
I did originally think it did, but it did not fix it. It's a bit of a nuisance to reproduce, and there wasn't enough shaking when I tested with this, or it was harder to reproduce as I expanded the buffer sizes. |
Yes, I think it's just in how GCC adjust $sp when there is a variable length array in the code -- it doesn't move the stack pointer, it just computes the address of the storage and uses it. That's presumably because nothing in the section of code writing the package to the stack might use the stack (call another function, etc), so it reduces the code by a couple of instructions.
Yeah, looks like the ABI defines a 16-byte red zone below $sp. In this case, the temporary storage was larger than that and so outside of that zone.
Yeah, this does look like it's just a compiler bug (however, it seems to be a cross-architecture bug?). Adding suitable comments would be a good idea.
Sounds good. I'm not likely to keep working this issue; I'm clearing a backlog of bugs found when running all of the Zephyr tests using picolibc, and this was one of the last to get "resolved". I'd appreciate it if someone else could take over and drive this to conclusion. |
The build produces a zephyr.lst file alongside the executable.
Agreed. Documenting the issue with actual compiler output is desirable |
The differences are extensive and I wasn't able to actually identify the problem, even after separately compiling the relevant file with -S -fverbose-asm. Beats me what's up here; I'd encourage those with more expert knowledge to go digging. |
I'll try to get a diff of the assembly code but in general I think we should address the red-zone problem anyway. Currently the entry/exit exception handling code is not considering the red-zone at all, probably clobbering it. So we should alternatively:
|
@keith-packard which logging sample were you using exactly? |
It was samples/subsys/logging/syst/sample.logger.syst.deferred This was reliably reproducing a failure when built using picolibc. I'm attempting to reproduce it with the current tree; I'll push a tree when I get something that breaks and then is fixed with the patch. |
My suspicions are more about Can't it be simplified as follows so to tell the compiler our actual
|
ok, I've reproduced this using my arm64-stack-smash branch -- this builds the whole system with picolibc, which also happens to disable the -ffreestanding compiler option; that may have an effect here? |
My understanding is that -ffreestanding suppresses optimizations based on known libc functions. I guess it's possible that that's unmasking an issue where e.g. some alloca()-adjacent call (alloca itself can't be implemented as a function, obviously) emitted here is being forced to go into a libgcc or libc implementation where it wouldn't earlier (or vice versa), and one has a bug? Seems a little far-fetched, but not implausible. That's the kind of thing that would be obvious from the generated assembly though; I'll pull and see if I can find something. Do we have any idea which expression including this macro is thought to be at fault? Also: yeah, I tracked down the ARM64 SysV ABI docs at https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst and indeed there is no red zone. The stuff I was remembering was about the microsoft ARM ABI, which reserves 16 bytes apparently. (Doubly hilarious, because on x86 MS has no red zone and SysV is 128 bytes!) So the interrupt code is (probably/maybe) off the hook, I think. |
Ooph, Z_LOG_MSG2_STACK_CREATE (the only macro that expands the VLA code in question) gets expanded 24 different times in that sample. This may take a while. |
Yeah, I'm equally stumped. FWIW: I don't see any deltas at all when I apply your barrier patch to the branch you posted. The generated code is the same. And reading it (it's not so bad, all the logging is in a single log_msgs() function), I'm not seeing the stack pointer being modified at all except to push the frame at entry. It's very conventional. Which I guess makes sense: this is the syst test, which means that there's no string formatting being done, and I'm guessing the VLAs are actually all being optimized out? Is there a toolchain or platform interaction maybe? I'm using SDK 0.14.2 and building samples/subsys/logging/syst for qemu_cortex_a53. |
I've pushed two branches -- arm64-stack-smash-broken and arm64-stack-smash-working -- the only difference is the addition of the two compiler_barrier() "calls". The broken branch reliably fails and the working branch reliably passes. The generated code is dramatically different, and in particular I've found at least one place where it defers moving sp until after data have been stored in those locations. This is a fragment of a diff of the .lst files with the addresses removed (or the diff is harder to understand). - is broken, + is working:
And this is the log message which gets corrupted by the interrupt. It seems like a very narrow window to be reliably hitting this on every run. But it's pretty clearly broken. I suspect the arm64 gcc backend just doesn't reliably keep sp below the used area of the stack all the time; we're having to force the issue by using a barrier. I'm also using zephyr sdk version 0.14.2 and building for qemu_cortex_a53. Here's the command I use to run the test:
Not having -ffreestanding in effect is pretty clearly a good plan here -- calls to memset followed by sprintk are replaced with a single call to strcpy. But, I'm not sure that is causing this particular issue? |
Ah, OK. I'd missed that you were running the deferred subtest config and had just been building the default test. And confirmed, I see the same failures you do. In twister. Interestingly if I build and run it from west it seems fine, I got 14 successes in a row after 4 consecutive failures (look like timeouts -- partial output in the log) in twister. That also tends to point at interrupt/timing. I'm looking at the assembly here too, but still not quite seeing it. Your example is almost a smoking gun, but that snippet doesn't actually mutate SP. Can you back up a bit to see where the x19 value comes from? If it's something like "sub x19, [sp, #16]" then we have it. But as I look at this, I'm seeing lots and lots of places where frankly the generated code is just bad and tends to repeatedly assign SP in noop contexts, even though the optimizer does like to move the assignments around. I haven't seen something provable yet (e.g. that could be submitted to the gcc folks as a bug), but I believe it's there. Can we add a comment to the patch fingering a presumptive compiler bug and wrap the barrier in something like "ARM64_VLA_PROTECT()"? I'll take off my -1, clearly something is up. |
Actually I can't remember if |
This ensures that the compiler will have moved the stack pointer below the stack area where the message will be constructed. Otherwise, the message can be smashed by an interrupt handler while it is being built. This bug was found on qemu_cortex_a53 using SDK 0.14.2 with gcc 10.3.0 when building the samples/subsys/logging/syst/sample.logger.syst.deferred test under twister using picolibc: Signed-off-by: Keith Packard <[email protected]>
Also I need to formally apologize to the arm64 folks for fingering you first instead of the toolchain. |
e1fd086
to
5e2dc52
Compare
Sorry, pushed a new version with an inline comment in the code and a better commit message. |
Upstream gcc bug submitted at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106153 |
40001838: f8010ec0 str x0, [x22, #16]! ;<-- auto-decrement x22 to be 16 bytes under the stack Unless there is a missing minus sign that didn't reach me, the above is
So nothing wrong to me. |
Errr, what I meant is:
|
FYI, there is an early build for Zephyr SDK with GCC 12.1 here: It would be interesting to see how it behaves, and if the issue has already been fixed there. p.s. you may need to use the topic-sdk15 branch for testing. |
Yeah, I was backwards. Spend too long looking for something you know is there and you'll end up seeing it even if it's not. Will keep digging, something's gotta be in there. |
Also I'm going to be flying tomorrow for a week of family vacation (and actually in Ecuador and the Galapagos and surely incommunicado for the two after that). So it may be a bit before I'll have significant time to spend on this again. But anyone interested: you can check the "main.i" preprocessed file in the gcc bug as an easy exerciser if you want to scan assembly. My current thought is writing a hacky "stack interpreter" to track pointers into stack memory vs. the SP register. |
* stack pointer soon enough | ||
*/ | ||
|
||
#define Z_LOG_ARM64_VLA_PROTECT() compiler_barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since its only for arm64 what about:
#define Z_LOG_ARM64_VLA_PROTECT() IF_ENABLED(CONFIG_ARM64, (copiler_barrier()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought #46986 was on arm32? Even if not, I still wonder if this is a problem in general GCC code or is specific to arm64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to #46967 (comment) it's not related and still reproduces after adding the barrier. If we suspect that it's affecting more platforms than im ok with keeping it for all platforms but then maybe remove ARM64
from the name and only mention in the comment that it has been seen on arm64 and other platforms may be affected as well.
Using
|
Thanks to @nashif we have at least solved a part of the mistery. The fact that this can be easily reproduced with The correct output of the test should be (at a certain point):
but on the broken branch the output is:
As you can see instead of |
Yea, that is because Note that chunk_p = ptr_to_chunk(allocated); This resolves to static inline chunk_t *
ptr_to_chunk(void * ptr)
{
return (chunk_t *) ((char *) ptr - MALLOC_HEAD);
} So dereferencing it in L761 is flagged by the static analysis: new_chunk_p->size = chunk_p->size - offset; @keith-packard since this code explicitly does an "unsafe access," disabling |
Nice catch on the twister interaction. Now I'm worried this is all going to be down to some kind of undefined behavior interaction and not technically a compiler bug at all. I was so sure we had one. |
Ah, sorry -- I should have reported more completely what the symptoms of failure were. The interrupt handler smashes the format string address so that when vfprintf does get called, the format string points into something which is definitely not a string. |
Sounds like the compiler is getting even deeper into code analysis which leading it astray in this case. Building a C library is getting pretty hard these days :-). Which compiler version is generating this warning? |
@keith-packard #46967 (comment) It was in GCC 12.1, but the same warning should be displayed in GCC 11.x as well. |
I run this code through gcc 11.3 for x86 testing without issue. I'll give 12 a try. |
Not sure if this is a smoking gun but I found some of the assembly code in the non-working case weird and it is against what @npitre said before:
back to the code: 40001810: 910003f3 mov x19, sp
; ...
; a lot of code where x19 is never touched
; ...
LOG_DBG("Debug message example, %f", 3.14159265359);
;
; at this point x19 is pointing to a location in memory past the SP:
;
; (gdb) p/x $sp
; $1 = 0x40060fe0
; (gdb) p/x $x19
; $2 = 0x40060fb0
;
400019a8: f0000040 adrp x0, 4000c000 <mmu_zephyr_ranges>
400019ac: 912bdc00 add x0, x0, #0xaf7
;
; with the next instruction we are storing the value onto the stack but using
; x19 instead of SP. The storage location is at lower addresses than SP:
;
; (gdb) p/x $x19+24
; $3 = 0x40060fc8
;
400019b0: f9000e60 str x0, [x19, #24]
400019b4: d0000040 adrp x0, 4000b000 <__rodata_region_start>
400019b8: fd475c00 ldr d0, [x0, #3768]
400019bc: d28000c0 mov x0, #0x6
;
; same as before. The next instruction is saving onto the stack using x19.
; SP is still behind.
;
400019c0: f9000a60 str x0, [x19, #16]
;
; we finally move SP to match X19
;
400019c4: 9100027f mov sp, x19
...
In the working case the store instruction using the x19 are never going to touch addresses past the SP, because SP and x19 are set to the same address at the very beginning. |
That's a good find indeed. However could you quote where sp and x19 start to diverge? Now you have:
Where is sp modified to not be equal to x19 anymore? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have time to deep dive into this ATM but the fix is actually working I'm fine with merging this as is.
This ensures that the compiler will have moved the stack pointer below the
stack area where the message will be constructed. Otherwise, the message
can be smashed by an interrupt handler while it is being built.
This was found on arm64 when running the logging sample using picolibc and
building with sdk version 0.14.2
Signed-off-by: Keith Packard [email protected]