Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llvmcall test failure with LLVM 3.6.0 on Win64 #10394

Closed
tkelman opened this issue Mar 3, 2015 · 63 comments
Closed

llvmcall test failure with LLVM 3.6.0 on Win64 #10394

tkelman opened this issue Mar 3, 2015 · 63 comments
Labels
system:windows Affects only Windows

Comments

@tkelman
Copy link
Contributor

tkelman commented Mar 3, 2015

See https://gist.github.com/tkelman/adddb1018adabbb6711b

Mostly lapack / ccall seems broken. The core failure in jl_method_table_assoc_exact is also worrying. Dict and hashing just die with no error.

On win32 things are reasonable if you run the tests one at a time (hits #10377 but otherwise ok), though trying to run all of the tests in the same process results in this:

    JULIA test/all
     * linalg1              in 109.44 seconds
     * linalg2              in 120.57 seconds
     * linalg3              in  34.47 seconds
     * linalg4              in  57.44 seconds
     * linalg/lapack        in   7.78 seconds
     * linalg/triangular    in 1033.00 seconds
     * linalg/tridiag       in  23.88 seconds
     * linalg/pinv          in  27.93 seconds
     * linalg/givens        in  19.15 seconds
     * linalg/cholesky      in  59.27 seconds
     * linalg/lu            in   0.94 seconds
     * linalg/arnoldi       in  19.38 seconds
     * core                 in 248.43 seconds
     * keywordargs          in  38.54 seconds
     * numbers              in 992.47 seconds
     * strings              in 401.38 seconds
     * dates                in 1106.73 seconds
     * dict                 in 155.75 seconds
     * hashing
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Makefile:9: recipe for target 'all' failed
make[1]: *** [all] Error 3
Makefile:496: recipe for target 'testall1' failed
make: *** [testall1] Error 2
@tkelman tkelman added the system:windows Affects only Windows label Mar 3, 2015
@vtjnash vtjnash mentioned this issue Mar 3, 2015
19 tasks
@tkelman
Copy link
Contributor Author

tkelman commented Mar 6, 2015

Oh boy, julia-debug.exe segfaults immediately in either win32 or win64. https://gist.github.com/3b9ddfa24b74b53814c9

edit: as of efb878e this no longer happens

@tkelman
Copy link
Contributor Author

tkelman commented Mar 18, 2015

With LLVM assertions enabled, we get

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
Assertion failed!

Program: D:\cygwin64\home\Tony\julia\usr\bin\julia-debug.exe
File: /home/Tony/julia/deps/llvm-3.6.0/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp, Line 104

Expression: GV.isGlobalVariable()

also, updated the original gist with a backtrace from the core segfault in jl_method_table_assoc_exact

@tkelman
Copy link
Contributor Author

tkelman commented Mar 18, 2015

I think most of the issue is something about ccall'ing into Fortran libraries, since openspecfun has trouble too. llvmcall fails an assertion:

     * llvmcall
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
Assertion failed!

Program: D:\cygwin64\home\Tony\julia\usr\bin\julia.exe
File: debuginfo.cpp, Line 340

Expression: SectionAddrCheck == SectionAddr

ccall segfaults https://gist.github.com/tkelman/697c8f73f072cfdecc9f
and socket segfaults https://gist.github.com/tkelman/084003af7f8f7c61ee4c
(these were built at 2a4fd44)

@ihnorton
Copy link
Member

While debugging #10638 I noticed the following: if I run the test lines on a local build against LLVM3.6, I get thousands of lines of On entry to DGEMV parameter number 1 had an illegal value. I believe this to be incidental (although it does confirm the reporter's observation that the segfault is directly related to the return value creation which does matrix multiplication). However, some googling suggests that this error may be related to calling convention (and bitness) issues. If so, that could explain some of the failures seen in this issue.

@tkelman
Copy link
Contributor Author

tkelman commented Mar 29, 2015

I'm fairly confident the Fortran by-reference calling convention in ccall is busted with LLVM 3.6 for some reason, since almost anything lapack or openspecfun or arpack related isn't working.

@ViralBShah
Copy link
Member

Cc: @vtjnash

@vtjnash
Copy link
Member

vtjnash commented Mar 30, 2015

It's possible/likely that llvm3.6 is using a different windows calling convention than previous versions of llvm. usually that has only affected C++11 code, but sometimes it crosses over to the C ABI.

With LLVM assertions enabled, we get ...

I've been struggling with that one since the new GC merged. I think Keno finally found the fix for it in da3d7d5

@ihnorton
Copy link
Member

I've tested this with all the calling conventions but haven't seen any change. It's possible I am setting things up wrong, but the error is in parameter 1 which should be the simplest.

@vtjnash
Copy link
Member

vtjnash commented Mar 30, 2015

if you're feeling particularly ambitious, it would be informative to single step by instructions through the dgemv64_ function until it calls xerbla, and see what value it loads for the first parameter. it seems odd that this would be failing with llvm3.6, since it doesn't seem like that much should have changed.

@ihnorton
Copy link
Member

Well, looking at the registers in an LLVM33 build, it gets the first argument in RCX (which is correctly a pointer to a char array containing N). With LLVM36 that pointer is garbage.

@vtjnash
Copy link
Member

vtjnash commented Mar 30, 2015

what does it point to? it should be a dynamically-allocated stack box, but perhaps it is declared in the wrong scope so llvm is deallocating it early?

@ihnorton
Copy link
Member

Nevermind, my function was still stdcall. The pointer is good but the data is \001. Most of the registers have the same (useless) pointer:

(gdb) info reg
rax            0x6d7cf320       1836905248
rbx            0xd5f250 14021200
rcx            0xd5f250 14021200
rdx            0xd5f250 14021200
rsi            0x1      1
rdi            0xd5f250 14021200
rbp            0xd5f290 0xd5f290
rsp            0xd5f110 0xd5f110
r8             0xd5f250 14021200
r9             0xd5f250 14021200
r10            0x82f2add0       2196942288
r11            0x2      2
r12            0xd5f250 14021200
r13            0x8311ee30       2198990384
r14            0x82f2ae00       2196942336
r15            0x830ee520       2198791456
rip            0x6d7cf33b       0x6d7cf33b <dgemv_+27>

@ihnorton
Copy link
Member

I checked the IR generated by clang (svn) for the same call, and it is basically the same, so this doesn't seem like our bug. Tonight I'll try turning off various passes, and also disabling copy propagation (#9199).

@ihnorton
Copy link
Member

Looks like this is just a deprecation issue with &. Switching to Ref{} fixes the problem in my test case.

(At some point we will need to change all the Fortran wrappers to use the new syntax, but that is orthogonal. We do need the deprecation to work for now).

@vtjnash
Copy link
Member

vtjnash commented Mar 31, 2015

those go through very different code paths, so it still doesn't point to what part of the llvm bytecode was emitted incorrectly

@ihnorton
Copy link
Member

ihnorton commented Apr 1, 2015

Here are several reduced examples (...compared to the 1000 lines of IR for the real dgemv!).

This is what we emit with 3.3.

Here are & and Ref examples with 3.6.

Here is a stand-alone version that is runnable with lli. See line 57: if that is commented out then the register is left alone and we get a pointer to char* 'N' in RCX as expected. (or alternatively if I comment out these lines and add a dummy %8 variable).

@vtjnash
Copy link
Member

vtjnash commented Apr 1, 2015

It looks like an llvm bug then? llvm comes with a bugpoint tool for isolating errors in lowering code. Alternatively, differencing the assembly generated by the good and bad versions of the llvm bytecode can help with identifying what went wrong. Since llvm36 added a custom lowing for the win64 alloca instruction (that turns it into a function call), I'm going to guess it might be not preserving this register correctly across that function

@ihnorton
Copy link
Member

ihnorton commented Apr 1, 2015

Yes, unless you or @Keno have a chance to look, and see anything obviously
wrong with the IR linked from the last paragraph. It seems to me that we
are emitting redundant allocas, so maybe the alloca lowering thinks it can
combine something that shouldn't be. I don't have a llvm-svn build right
now but I'm going to double-check that, and work through the assembly next
while it is compiling (sigh, should have gotten a bigger SSD).

On Wed, Apr 1, 2015 at 5:56 PM, Jameson Nash [email protected]
wrote:

It looks like an llvm bug then? llvm comes with a bugpoint tool for
isolating errors in lowering code. Alternatively, differencing the assembly
generated by the good and bad versions of the llvm bytecode can help with
identifying what went wrong. Since llvm36 added a custom lowing for the
win64 alloca instruction (that turns it into a function call), I'm going to
guess it might be not preserving this register correctly across that
function


Reply to this email directly or view it on GitHub
#10394 (comment).

@vtjnash
Copy link
Member

vtjnash commented Apr 1, 2015

Since you have the llvm bytecode now, you can run llc on any machine to investigate the resulting assembly

@vtjnash
Copy link
Member

vtjnash commented Apr 1, 2015

The redundant alloca might be an artifact of not having marked the arguments as isSA in the initial codegen assignments I don't think that should matter here though?

@ihnorton
Copy link
Member

ihnorton commented Apr 2, 2015

Thanks, I do have llc. I just want to check llvm-svn before filing a bug.

Is there that much difference in codegen paths between 3.5 and 3.6 (on our side)? @tkelman do you have an LLVM3.5 build -- if so, could you check the IR for this function?

@vtjnash
Copy link
Member

vtjnash commented Apr 2, 2015

the ___chkstk call in that assembly reallly shouldn't be there at all in that function's assembly. however, at least on llvm-svn it calls __chkstk_ms

@ihnorton
Copy link
Member

ihnorton commented Apr 2, 2015

Here is the code_native too (per irc).

@tkelman
Copy link
Contributor Author

tkelman commented Apr 2, 2015

I can't get a build using 3.5.1 to finish bootstrapping on current master, unknown function access violation in linalg1?

@vtjnash
Copy link
Member

vtjnash commented Apr 2, 2015

we should cherry-pick the following commit onto our llvm-3.6.0, otherwise the MCJIT will remain DOA for win64 (#9339):

commit c9fbc97e9553b257d35ba4507a478bc8fc9459fb
Author: Reid Kleckner <[email protected]>
Date:   Thu Jan 29 23:58:04 2015 +0000

    x86: Fix large model calls to __chkstk for dynamic allocas

    In the large code model, we now put __chkstk in %r11 before calling it.

    Refactor the code so that we only do this once. Simplify things by using
    __chkstk_ms instead of __chkstk on cygming. We already use that symbol
    in the prolog emission, and it simplifies our logic.

    Second half of PR18582.

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227519 91177308-0d34-0410-b5e6-96231b3b80d8

@tkelman
Copy link
Contributor Author

tkelman commented Apr 2, 2015

We should be able to pull that in as a patch file, unless we're also planning on relying on a more substantially modified branch soon.

@tkelman
Copy link
Contributor Author

tkelman commented Apr 3, 2015

Not sure what was wrong yesterday, works better with 3.5.1 today.

julia> versioninfo()
Julia Version 0.4.0-dev+4142
Commit 8e32963* (2015-04-02 19:17 UTC)
DEBUG build
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.5.1

julia> n = 'N'; i = 1;

julia>  bf(n,i) = ccall((:mydgemv, "libt.dll"),
                     Void,
                     (Ptr{Cchar},Ptr{Int}), &n, &i)
bf (generic function with 1 method)

julia> bf(n,i)
N

julia> @code_llvm bf(n,i)

; Function Attrs: sspreq uwtable
define void @julia_bf_1314(i32, i64) #0 {
top:
  %sext = shl i32 %0, 24, !dbg !11
  %2 = ashr exact i32 %sext, 24, !dbg !11
  %3 = icmp eq i32 %2, %0, !dbg !11
  br i1 %3, label %pass, label %fail, !dbg !11

fail:                                             ; preds = %top
  %4 = load %jl_value_t** @jl_inexact_exception, align 8, !dbg !11, !tbaa %jtbaa
_const
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %4, i32 1), !dbg !1
1
  unreachable, !dbg !11

pass:                                             ; preds = %top
  %5 = trunc i32 %0 to i8, !dbg !11
  %6 = alloca i8, align 1, !dbg !11
  store i8 %5, i8* %6, align 1, !dbg !11
  %7 = alloca i64, align 8, !dbg !11
  store i64 %1, i64* %7, align 8, !dbg !11
  call void inttoptr (i64 1825313888 to void (i8*, i64*)*)(i8* %6, i64* %7), !db
g !11
  ret void, !dbg !11
}

@ihnorton
Copy link
Member

Looks like it's still a calling convention issue; 3.6 is putting args to pow in the wrong registers. assembly comparisons. I'm waiting for my llvm-svn update to finish building to see if there is any improvement.

@vtjnash
Copy link
Member

vtjnash commented Apr 15, 2015

wow, it's my second copyprop bug of the day!

julia/src/codegen.cpp

Lines 5607 to 5609 in d2ee85d

#if defined(LLVM35) && defined(_OS_WINDOWS_) && !defined(LLVM36)
,"-disable-copyprop" // llvm bug 21743
#endif
should be applied to all versions of llvm (tested, but not entirely confirmed, on LLVM3.3 and LLVM3.6 today on win64)

@ihnorton
Copy link
Member

😭

@vtjnash
Copy link
Member

vtjnash commented Apr 15, 2015

cross link #9199

@ihnorton
Copy link
Member

How did you test that? I get '-disable-copyprop' is not a recognized feature for this target (ignoring feature)

@vtjnash
Copy link
Member

vtjnash commented Apr 15, 2015

in various indirect ways. i may have missed some combination of julia and llvm versions?

@ihnorton
Copy link
Member

Sorry, I just saw your comments on the gist (because github). Replied there (yes, they were swapped. too many consoles).

@ihnorton
Copy link
Member

I tested with MAttrs.push_back("-disable-copyprop") outside of that ifdef block (and I know it was actually applied because of the error message). I didn't do a full rebuild though, just src/, not sure if that makes a difference due to the cached image? (rebuilding now grr. have to rebuild ~everything because haven't done so since the usr-staging change. will look in the morning)

@ihnorton
Copy link
Member

Looking better now: llvmcall and linalg2 both pass. Now I get a weird failure in the markdown test, not sure if related:

$ make markdown
    JULIA test/markdown
     * markdown            exception on 1: ERROR: LoadError: test failed: (Base.Markdown.MD(Any[Base.Markdown.Paragraph(Any["#"]),Base.Markdown.Paragraph(Any["empty"])],Dict{Any,Any}(:config=>Base.Markdown.Config([hashheader,fencedcode,blockquote],[blocktex,blockinterp,list,indentcode,github_table,horizontalrule,setextheader,paragraph],Dict('['=>[link],'\\'=>[linebreak,escapes],'*'=>[asterisk_bold,asterisk_italic],'$'=>[tex,interp],'`'=>[inline_code],'!'=>[image],'-'=>[en_dash])))) == Base.Markdown.MD(Any[Base.Markdown.Header{1}(""),Base.Markdown.Paragraph("empty")],Dict{Any,Any}()))
 in expression: @md_str $(Expr(:triple_quoted_string, "\r\n  #\r\n  empty\r\n  ")) == MD(Header{1}(""),Paragraph("empty"))
 in error at .\error.jl:19
 in default_handler at .\test.jl:27
 in do_test at .\test.jl:50
 in runtests at C:\dev\jl36-cyg\test\testdefs.jl:78
 in anonymous at .\multi.jl:626
 in run_work_thunk at .\multi.jl:587
 in remotecall_fetch at .\multi.jl:660
 in remotecall_fetch at .\multi.jl:675
 in anonymous at .\task.jl:1386
while loading markdown.jl, in expression starting on line 14
ERROR: LoadError: LoadError: test failed: (Base.Markdown.MD(Any[Base.Markdown.Paragraph(Any["#"]),Base.Markdown.Paragraph(Any["empty"])],Dict{Any,Any}(:config=>Base.Markdown.Config([hashheader,fencedcode,blockquote],[blocktex,blockinterp,list,indentcode,github_table,horizontalrule,setextheader,paragraph],Dict('['=>[link],'\\'=>[linebreak,escapes],'*'=>[asterisk_bold,asterisk_italic],'$'=>[tex,interp],'`'=>[inline_code],'!'=>[image],'-'=>[en_dash])))) == Base.Markdown.MD(Any[Base.Markdown.Header{1}(""),Base.Markdown.Paragraph("empty")],Dict{Any,Any}()))
 in expression: @md_str $(Expr(:triple_quoted_string, "\r\n  #\r\n  empty\r\n  ")) == MD(Header{1}(""),Paragraph("empty"))
 in error at .\error.jl:19
 in default_handler at .\test.jl:27
 in do_test at .\test.jl:50
 in runtests at C:\dev\jl36-cyg\test\testdefs.jl:78
 in anonymous at .\multi.jl:626
 in run_work_thunk at .\multi.jl:587
 in remotecall_fetch at .\multi.jl:660
 in remotecall_fetch at .\multi.jl:675
 in anonymous at .\task.jl:1386
while loading markdown.jl, in expression starting on line 14
while loading C:\dev\jl36-cyg\test\runtests.jl, in expression starting on line 3

Makefile:9: recipe for target 'markdown' failed
make: *** [markdown] Error 1

@ihnorton
Copy link
Member

Ok, false alarm. Seems that the error was caused by having windows line endings in test/markdown.jl. I'm not quite sure how that happened, because I never touched the file, git status was clean before (I think) and is still clean now after running dos2unix test/markdown.jl. But now the test passes.

@ihnorton
Copy link
Member

Current status: sparse and parallel failed, but they both pass when run individually.

@vtjnash
Copy link
Member

vtjnash commented Apr 15, 2015

parallel failed because sparse had previously failed (and killed the worker). so it looks like just sparse had some alignment or corruption issue.

@tkelman
Copy link
Contributor Author

tkelman commented Apr 15, 2015

the sparse failure looks quite similar to what I saw above #10394 (comment), something in umfpack

@tkelman
Copy link
Contributor Author

tkelman commented Apr 16, 2015

Note that I still see the same assertion failure in llvmcall when I have MARCH = x86-64

@tkelman
Copy link
Contributor Author

tkelman commented Apr 16, 2015

Was having trouble getting a breakpoint to work at the assertion failure, so resorted to the trusty printf debugger. With this patch

diff --git a/src/debuginfo.cpp b/src/debuginfo.cpp
index 3b7465a..c443ef4 100644
--- a/src/debuginfo.cpp
+++ b/src/debuginfo.cpp
@@ -251,8 +251,9 @@ public:
@@ -251,8 +251,9 @@ public:
                 UnwindData = (uint8_t*)Addr;
                 if (SectionAddrCheck)
                     assert(SectionAddrCheck == SectionAddr);
-                else
+                else {
                     SectionAddrCheck = SectionAddr;
+                    printf("Section %s has SectionAddr = %lld\n", sName.data(), SectionAddr); }
             }
             if (sName.equals("__catchjmp")) {
                 sym_iter.getAddress(Addr);
@@ -269,10 +270,12 @@ public:
                 catchjmp = (uint8_t*)Addr;
                 if (SectionAddrCheck)
                     assert(SectionAddrCheck == SectionAddr);
-                else
+                else {
                     SectionAddrCheck = SectionAddr;
+                    printf("Section %s has SectionAddr = %lld\n", sName.data(), SectionAddr); }
             }
         }
+        // should we reset SectionAddrCheck = 0; after this loop?
         assert(catchjmp);
         assert(UnwindData);
         catchjmp[0] = 0x48;
@@ -333,11 +336,15 @@ public:
             Section->getSize(SectionSize);
 #endif
             sym_iter.getName(sName);
+            printf("Section %s has SectionAddr = %lld\n", sName.data(), SectionAddr);
 #ifdef _CPU_X86_
             if (sName[0] == '_') sName = sName.substr(1);
 #endif
-            if (SectionAddrCheck)
-                assert(SectionAddrCheck == SectionAddr);
+            if (SectionAddrCheck) {
+                if (SectionAddrCheck != SectionAddr) {
+                    printf("SectionAddrCheck = %lld\n", SectionAddrCheck);
+                    printf("SectionAddr = %lld\n", SectionAddr); }
+                assert(SectionAddrCheck == SectionAddr); }
             else
                 SectionAddrCheck = SectionAddr;
             create_PRUNTIME_FUNCTION(

I get the following output

Section .text has SectionAddr = 641925120
Section jlcall_add1234_1203 has SectionAddr = 641925120
Section julia_add1234_1203 has SectionAddr = 641925120
Section julia_add1234_12031 has SectionAddr = 641925504
SectionAddrCheck = 641925120
SectionAddr = 641925504
A s s e r t i o n   f a i l e d !

 P r o g r a m :   D : \ c y g w i n 6 4 \ h o m e \ T o n y \ j u l i a \ u s r
 \ b i n \ j u l i a - d e b u g . e x e
 F i l e :   d e b u g i n f o . c p p ,   L i n e   3 4 7

 E x p r e s s i o n :   S e c t i o n A d d r C h e c k   = =   S e c t i o n A
 d d r

@vtjnash
Copy link
Member

vtjnash commented Apr 16, 2015

it looks like Comdat is enabled, but that shouldn't be true unless imaging_mode==1. it was expected that all sections would be named .text (which is really what this assert was attempting to verify).
edit: scratch that, i see that the printed text is just incorrect, and it is printing the symbol name not the section name. (the net result was still roughly the same however, since that last function got emitted into the LinkOnceODRLinkage section, while the rest did go into .text)

but from this, it appears that there are some bugs in the way llvmcall emits its extra function:
a) it does not generate a guaranteed unique name: it should use a namespace that does not overlap with julia functions (very low probability of issue from this, however)
b) it should delete the function after inlining to ensure it doesn't get JIT compiled

@tkelman
Copy link
Contributor Author

tkelman commented Apr 16, 2015

Nice, that does fix the llvmcall assertion failure. The umfpack problem might be worth opening a new issue for. Or repurposing this one.

@vtjnash
Copy link
Member

vtjnash commented Apr 16, 2015

let's go the new issue route. with 59 comments, this one is getting a bit long to keep track of. and i think it should now be possible to open narrow issues

@ihnorton
Copy link
Member

👏

@vtjnash
Copy link
Member

vtjnash commented May 1, 2015

bump. can you open the new issues for remaining bugs and cross link them here for reference? I think the only remaining one was the linalg/arnoldi when running all linalg tests on one process?

@tkelman
Copy link
Contributor Author

tkelman commented May 1, 2015

Sure. Also the bad_alloc in hashing on win32 still happens last I checked.

@tkelman
Copy link
Contributor Author

tkelman commented May 1, 2015

#11083
#11085

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:windows Affects only Windows
Projects
None yet
Development

No branches or pull requests

4 participants