-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix building with LLVM 5 and 6 so that the stack allocation workaround is no longer needed. #2371
Comments
When this ticket is resolved, we can mark LLVM >= 4 support as stable rather than experimental, and we can remove official support for all but the three most recent "patch levels" of the three most recent "minor versions" of LLVM. |
This change adds experimental support for LLVM 6.0.0 and removes support for 4.0.1. The change still includes the workaround in the `Heap2Stack` optimization pass that can cause performance degradation (see #2371).
This change adds experimental support for LLVM 6.0.0 and removes support for 4.0.1. The change still includes the workaround in the `Heap2Stack` optimization pass that can cause performance degradation (see ponylang#2371).
I can't reproduce this on LLVM 5.0.0. That is, removing the workaround does not cause segfaults. This comment suggests that the issue may have been fixed in Clang (not LLVM): |
On first glance that issue doesn't seem related, but we did have to update TBAA for LLVM 4, so who knows? FYI I have been doing some work on using LLVM 7 (https://github.com/kulibali/ponyc/tree/llvm700), but it's not done yet. I could use some help with the JIT implementation. |
To be clear, this is the workaround that we're talking about removing: ponyc/src/libponyc/codegen/genopt.cc Lines 305 to 313 in d446387
When removed, even on LLVM 7.0.1 we still segfault when running standard library tests that were compiled in release mode (that is, when ponyc was run with LLVM optimizations enabled). I've spent quite a bit of time troubleshooting this one. This is indeed an issue with the Here is a minimal program that demonstrates the issue, based on stripping away unnecessary pieces of a segfaulting class val _Field
let _inner: U32
new trn create(inner: U32 = 0) => _inner = inner
fun val clone(): _Field trn^ => create(_inner)
class val Counter
let _value: USize
let _field: _Field
new val create(v: USize = 0, f: _Field = _Field) =>
_value = v
_field = f
fun val inc(v: USize = 1): Counter =>
create(_value + v, _field.clone())
fun val value(): USize =>
_value
actor Main
new create(env: Env) =>
segfault(env)
fun tag segfault(env: Env) =>
var counter = Counter
var n: USize = 0
while (n = n + 1) < 300 do
counter = counter.inc()
@printf[I32]("%zu\n".cpointer(), counter.value())
end This program should print the numbers 1 through 300 in sequence, but instead prints the number 1 each time the loop is run. Some other very similar variants of this minimal program will segfault. When compiled, the pretty much all of the functions called from within the The job of the However, when the Here is the final erroneous code LLVM IR that gets produced after the LICM pass runs: define private fastcc nonnull %None* @Main_tag_segfault_oo(%Main* nocapture readnone dereferenceable(256) %this, %Env* noalias nocapture readonly dereferenceable(64) %env) unnamed_addr !dbg !150 !pony.abi !4 {
entry:
%0 = alloca [32 x i8], align 8
%1 = alloca [32 x i8], align 8
%.sub12 = getelementptr inbounds [32 x i8], [32 x i8]* %1, i64 0, i64 0
%.sub = getelementptr inbounds [32 x i8], [32 x i8]* %0, i64 0, i64 0
%2 = tail call i8* @pony_ctx()
%3 = tail call i8* @pony_alloc_small(i8* %2, i32 0)
%4 = bitcast i8* %3 to %_Field_Desc**
store %_Field_Desc* @_Field_Desc, %_Field_Desc** %4, align 8, !tbaa !160
%5 = getelementptr inbounds i8, i8* %3, i64 8, !dbg !176
%6 = bitcast i8* %5 to i32*, !dbg !176
store i32 0, i32* %6, align 4, !dbg !176, !tbaa !177
%7 = bitcast [32 x i8]* %1 to %Counter_Desc**
store %Counter_Desc* @Counter_Desc, %Counter_Desc** %7, align 8, !tbaa !160
%8 = getelementptr inbounds [32 x i8], [32 x i8]* %1, i64 0, i64 8, !dbg !196
%9 = bitcast i8* %8 to i64*, !dbg !196
store i64 0, i64* %9, align 8, !dbg !196, !tbaa !197, !noalias !200
%10 = getelementptr inbounds [32 x i8], [32 x i8]* %1, i64 0, i64 16, !dbg !203
%11 = bitcast i8* %10 to i8**, !dbg !203
store i8* %3, i8** %11, align 8, !dbg !203, !tbaa !197, !noalias !200
%12 = bitcast [32 x i8]* %0 to %Counter_Desc**
%13 = getelementptr inbounds [32 x i8], [32 x i8]* %0, i64 0, i64 8
%14 = bitcast i8* %13 to i64*
%15 = getelementptr inbounds [32 x i8], [32 x i8]* %0, i64 0, i64 16
%16 = bitcast i8* %15 to i8**
br label %while_body
while_body: ; preds = %while_body, %entry
%counter.0.in = phi i8* [ %.sub12, %entry ], [ %.sub, %while_body ]
%n.0 = phi i64 [ 1, %entry ], [ %31, %while_body ], !dbg !208
%17 = getelementptr inbounds i8, i8* %counter.0.in, i64 8, !dbg !219
%18 = bitcast i8* %17 to i64*, !dbg !219
%19 = load i64, i64* %18, align 8, !dbg !219, !tbaa !220, !alias.scope !222
%20 = add i64 %19, 1, !dbg !219
%21 = getelementptr inbounds i8, i8* %counter.0.in, i64 16
%22 = bitcast i8* %21 to %_Field**
%23 = load %_Field*, %_Field** %22, align 8, !tbaa !220, !alias.scope !222
%24 = getelementptr inbounds %_Field, %_Field* %23, i64 0, i32 1
%25 = load i32, i32* %24, align 4, !tbaa !231, !alias.scope !233, !noalias !222
%26 = tail call i8* @pony_alloc_small(i8* %2, i32 0) #2, !noalias !236
%27 = bitcast i8* %26 to %_Field_Desc**
store %_Field_Desc* @_Field_Desc, %_Field_Desc** %27, align 8, !tbaa !160, !noalias !236
%28 = getelementptr inbounds i8, i8* %26, i64 8, !dbg !240
%29 = bitcast i8* %28 to i32*, !dbg !240
store i32 %25, i32* %29, align 4, !dbg !240, !tbaa !177, !noalias !236
store i8* %26, i8** %16, align 8, !dbg !245, !tbaa !197, !noalias !246
%30 = tail call i32 (...) @printf(i8* getelementptr inbounds ([5 x i8], [5 x i8]* @0, i64 0, i64 0), i64 %20), !dbg !273
%31 = add nuw nsw i64 %n.0, 1, !dbg !274
%exitcond = icmp eq i64 %31, 301
br i1 %exitcond, label %while_post, label %while_body
while_post: ; preds = %while_body
%.lcssa = phi i64 [ %20, %while_body ]
store %Counter_Desc* @Counter_Desc, %Counter_Desc** %12, align 1, !tbaa !160, !noalias !222
store i64 %.lcssa, i64* %14, align 1, !dbg !275, !tbaa !197, !noalias !246
ret %None* @None_Inst, !dbg !205
} Note that the I spent some time trying to studying and stepping through the LICM code to see what it was missing, but couldn't really figure it out. Finally, I decided to try dumping the LLVM IR just before the LICM pass, then run it in LLVM's However, to my surprise, the same bug doesn't happen when I run the same LLVM IR through the LICM pass in the Here it is in
Here it is in the buggy optimizer (embedded in ponyc):
Note that the latter (buggy) case has So, something is causing the So, something in the ponyc execution is causing the alias analysis to be corrupted without invalidating it, or otherwise not be regenerated when it should be (or perhaps it's being incorrectly generated for some other unknown reason). Perhaps one of our custom passes is breaking the alias analysis without marking it as dirty? Once I figure out how to fix the alias analysis to run again/properly prior to the LICM pass, this should be fixed, I think. |
We updated the Type-Based Alias Analysis implementation for LLVM 4, might want to look at those changes. |
Thanks for the nudge, I'll look into it. I'm guessing that probably has something to do with it, because every other idea I've had about how alias analysis may be getting messed up hasn't panned out to be true so far. |
Yep, sure enough, I'm now able to reproduce the buggy behaviour when I include a Type-Based Alias Analysis pass in the So yeah, that implies something is probably buggy with our TBAA metadata, leading to false negatives in the alias analysis done by LLVM (or, less likely, buggy type-based alias analysis logic in LLVM). |
Here's a minimal LLVM IR repro that acts like Pony code does and shows the problem in action (I've removed most of the generated code to leave only the ; ModuleID = 'test.jemc'
@0 = private unnamed_addr constant [5 x i8] c"%ld\0A\00"
declare i32 @printf(...) local_unnamed_addr
define i32 @main() unnamed_addr {
entry:
%counter_a = alloca [32 x i8], align 8
%counter_b = alloca [32 x i8], align 8
%counter_a.gep0 = getelementptr inbounds [32 x i8], [32 x i8]* %counter_a, i64 0, i64 0
%counter_b.gep0 = getelementptr inbounds [32 x i8], [32 x i8]* %counter_b, i64 0, i64 0
%counter_a.value = getelementptr inbounds [32 x i8], [32 x i8]* %counter_a, i64 0, i64 8
%counter_a.value.cast = bitcast i8* %counter_a.value to i64*
store i64 0, i64* %counter_a.value.cast, align 8, !tbaa !197
%counter_b.value = getelementptr inbounds [32 x i8], [32 x i8]* %counter_b, i64 0, i64 8
%counter_b.value.cast = bitcast i8* %counter_b.value to i64*
br label %loop_body
loop_body: ; preds = %loop_body, %entry
; In the first iteration, set 'n = 1'
; In subsequent iterations, retain 'n.plus1' as 'n'
%n = phi i64 [ 1, %entry ], [ %n.plus1, %loop_body ]
; In the first iteration, set 'counter = counter_a';
; In subsequent iterations, retain 'counter_b' as 'counter'
%counter = phi i8* [ %counter_a.gep0, %entry ], [ %counter_b.gep0, %loop_body ]
; Extract the current 'counter.value', add 1, and store into 'counter_b.value'
%counter.value = getelementptr inbounds i8, i8* %counter, i64 8
%counter.value.cast = bitcast i8* %counter.value to i64*
%counter.value.load = load i64, i64* %counter.value.cast, align 8, !tbaa !220
%counter.value.plus1 = add i64 %counter.value.load, 1
store i64 %counter.value.plus1, i64* %counter_b.value.cast, align 8, !tbaa !197
; Print 'counter.value.plus1'
%printf.retval = tail call i32 (...) @printf(i8* getelementptr inbounds ([5 x i8], [5 x i8]* @0, i64 0, i64 0), i64 %counter.value.plus1)
; Increment 'n' and exit the loop if 'n >= 5'
%n.plus1 = add nuw nsw i64 %n, 1
%loop.condition = icmp eq i64 %n.plus1, 5 ; stop after 4 iterations
br i1 %loop.condition, label %after_loop, label %loop_body
after_loop: ; preds = %loop_body
ret i32 0
}
!0 = !{!"Pony TBAA"}
!197 = !{!198, !198, i32 0}
!198 = !{!"Counter_ref", !199}
!199 = !{!"Counter_box", !0}
!220 = !{!221, !221, i32 0}
!221 = !{!"Counter_val", !0} You can run the example with an invocation like: ; ModuleID = 'test/Transforms/LICM/jemc.ll'
source_filename = "test/Transforms/LICM/jemc.ll"
@0 = private unnamed_addr constant [5 x i8] c"%ld\0A\00"
declare i32 @printf(...) local_unnamed_addr
define i32 @main() unnamed_addr {
entry:
%counter_a = alloca [32 x i8], align 8
%counter_b = alloca [32 x i8], align 8
%counter_a.gep0 = getelementptr inbounds [32 x i8], [32 x i8]* %counter_a, i64 0, i64 0
%counter_b.gep0 = getelementptr inbounds [32 x i8], [32 x i8]* %counter_b, i64 0, i64 0
%counter_a.value = getelementptr inbounds [32 x i8], [32 x i8]* %counter_a, i64 0, i64 8
%counter_a.value.cast = bitcast i8* %counter_a.value to i64*
store i64 0, i64* %counter_a.value.cast, align 8, !tbaa !0
%counter_b.value = getelementptr inbounds [32 x i8], [32 x i8]* %counter_b, i64 0, i64 8
%counter_b.value.cast = bitcast i8* %counter_b.value to i64*
br label %loop_body
loop_body: ; preds = %loop_body, %entry
%n = phi i64 [ 1, %entry ], [ %n.plus1, %loop_body ]
%counter = phi i8* [ %counter_a.gep0, %entry ], [ %counter_b.gep0, %loop_body ]
%counter.value = getelementptr inbounds i8, i8* %counter, i64 8
%counter.value.cast = bitcast i8* %counter.value to i64*
%counter.value.load = load i64, i64* %counter.value.cast, align 8, !tbaa !4
%counter.value.plus1 = add i64 %counter.value.load, 1
%printf.retval = tail call i32 (...) @printf(i8* getelementptr inbounds ([5 x i8], [5 x i8]* @0, i64 0, i64 0), i64 %counter.value.plus1)
%n.plus1 = add nuw nsw i64 %n, 1
%loop.condition = icmp eq i64 %n.plus1, 5
br i1 %loop.condition, label %after_loop, label %loop_body
after_loop: ; preds = %loop_body
%counter.value.plus1.lcssa = phi i64 [ %counter.value.plus1, %loop_body ]
store i64 %counter.value.plus1.lcssa, i64* %counter_b.value.cast, align 1, !tbaa !0
ret i32 0
}
!0 = !{!1, !1, i32 0}
!1 = !{!"Counter_ref", !2}
!2 = !{!"Counter_box", !3}
!3 = !{!"Pony TBAA"}
!4 = !{!5, !5, i32 0}
!5 = !{!"Counter_val", !3}
Note that when run with |
Interestingly... running So the same TBAA format we use now for LLVM >= 4 worked fine in LLVM 3. So either:
|
I'm leaning toward the "our TBAA format has always been broken" option - it seems to me that including the rcap in the TBAA is a mistake that will cause problems in the presence of recovery, since caps may be lifted and in so doing break the subtype aliasing rules that TBAA depends on, right? In this case it's not Note that this example would work fine if all three TBAA tree nodes ( Also, there is an error in the current TBAA tree structure - Marking as "needs discussion during sync" so we can make a decision about whether to move forward with my proposed fix or take some other path. |
I tried the approach outlined above (removing caps from TBAA types), and it fixes the minimal LLVM IR repro I outlined above, but when I compile the (less minimal) Pony repro it still has a related issue where the alias isn't getting detected and the Frustratingly, when I copy the full LLVM IR for that Pony repro out, and feed it into I'm not going to do this right now. I'll wait til the LLVM 7 PR is merged, then do a separate PR to stop using the legacy pass builder, then come back to this issue if it's still a problem. |
@sylvanc confirmed via Slack that he agrees that this seems to be unsound. It can be removed as part of the fix for this issue, though as I mentioned in my comment above, there is more to be done as well. |
Using the new PassBuilder is impractical for LLVM 3.9, but it will pave the way to fixing some other problems with LLVM > 4, including #2371.
To recap, the current state of this is:
|
New development in the saga - the optimized LLVM attributes for
It's flawed because LLVM can optimize across time in cases such as the LICM pass. |
Nice find |
Code generated by a
ponyc
built with LLVM 4 and up will often segfault on *nix (see #2303, #2061, #1592). The change in LLVM that broke things is believed to be 290738:#2303 implements a workaround (in the
canStackAlloc()
function ingenopt.cc
) where in ourHeapToStack
optimization pass we do not stack allocate values that are loaded in a function. But this incurs a performance hit, so we need to figure out the real problem.The text was updated successfully, but these errors were encountered: