Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threading without using patched LLVM #14083

Merged
merged 6 commits into from
Nov 23, 2015
Merged

Threading without using patched LLVM #14083

merged 6 commits into from
Nov 23, 2015

Conversation

yuyichao
Copy link
Contributor

This tries to implement what @vtjnash suggested for making threading working without LLVM patch. This should also make it possible to support threading on windows since (I was told that) the LLVM tls patch doesn't support emitting TLS on windows.

A few notes and TODOs (all done):

  1. The first few commits in this PR are not directly related to TLS, they are either slightly related
    clean-up's or fix of the bug discovered when testing this PR.

    (merged as Preserve function attributes in prepare_call (plus clean up) #14084)
  2. This is currently GCC/Clang only due to the __attribute__((const)) but it is only for performance
    and not for correctness so can easily be disabled for other compilers (MSVC) or replaced with
    equivalent attributes.
    (fixed, MSVC doesn't/won't support it, use __declspec(noalias) on MSVC)
  3. This placed all TLS variables in a big struct. Mainly because it's easier to do... There's no #ifdef's
    in the struct in order to keep it clean and to make the threading and non-threading versions ABI
    compatible (to certain degree anyway...). This also effectively export all the TLS variables but we
    could easily fix this using a nested struct (and only exposing the pointer to the inner, exported
    struct) if it is a concern. (FWIW, this is C and there's way more things you can do to break stuff
    other than accessing a non-exported TLS variable....)
  4. Compilation passes, REPL works and all tests passes for both threading and non-threading on
    LLVM37. (The compilation of inference0.ji with threading on had one SegFault that I couldn't
    reproduce).
    (reported as Race condition in threads initialization. #14091)
  5. Codegen is always accessing the TLS variables through a (const) function call. I get a freeze
    before enterning the REPL if I let it use the global variable directly for non-threading version.
    (probably because I didn't declare the struct correctly for LLVM).
    (fixed). OTOH, with the
    function attributes set (copied from Clang), llvm is able to emit only on call to jl_get_ptls_states()
    per function.
  6. TLS model is still global-dynamic. Using a dynamic model by default. Override to a static model in repl. The override has to be done at initialization time and can be done for embedding as well if needed.

c.c @tkelman @Keno

edit: closes #13975

@tkelman
Copy link
Contributor

tkelman commented Nov 21, 2015

Lovely, thank you for this. This will close #13975. I wonder whether this would allow us to turn threading on by default before we move to newer LLVM? First time I tried that on win64 it segfaulted in bootstrap, though will try another few times to see if it's the same problem as #14091.

@yuyichao
Copy link
Contributor Author

This will close #13975

I suspect #13975 is the same with #14091 (since it segfault before the LLVM error)

@yuyichao yuyichao force-pushed the yyc/tls-no-patch branch 2 times, most recently from 2ecc7e5 to b2339b8 Compare November 22, 2015 05:27
@yuyichao yuyichao changed the title WIP: threading without using patched LLVM Threading without using patched LLVM Nov 22, 2015
@yuyichao
Copy link
Contributor Author

Squashed and all the TODO's in this PR are done now! Remove WIP from the title and this PR is ready for review.

This essentially implements a runtime weak symbol. It would be nice if the dynamic linker can do this for us (I tried __attribute__((weak)) but couldn't get it working).

Getting the address of the TLS variable is not as fast as (~30-60% slower) the optimal case (a single function returning a static TLS address) but is also much faster (2x-3x) than using the global dynamic model. Note that this is also only relavant for code in the sys.so, libjulia.so and julia-ui (basically any binary code we save to/load from disk), JIT code use the actually getter function address directly. The impact on the overall performance will also be much smaller since we only emit TLS variables in expensive functions (gcframe, catch exceptions).

The advantage of this approach is that it should be able to support all platforms (non-linux have a lot of issues with unresolved symbols in a shared library) and doesn't require another shim for embedding.

@@ -5640,6 +5625,25 @@ static void init_julia_llvm_env(Module *m)
"jl_float_temp"));
#endif

#ifndef JULIA_ENABLE_THREADING
jltls_states_var =
new GlobalVariable(*m, ArrayType::get(T_int8, sizeof(jl_tls_states_t)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be declared as an ArrayType of T_pint8 so that alignment is correct and to avoid the BitCastOp on most usages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had that at first but was a little bit paranoid about the size I should supply since jl_tls_states_t has a int16_t and is not exactly multiple of pointer size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, although although sizeof must be a multiple of __alignof in C, so the over-allocation is guaranteed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I'll rearange the members a little and declare it as a pointer array to LLVM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. You are right that sizeof(jl_tls_states_t) % sizeof(void*) == 0. The bitcast was actually not necessary anyway since it's always casted to T_ppjlvalue by emit_nthptr_addr anyway....

@vtjnash
Copy link
Member

vtjnash commented Nov 22, 2015

lgtm. thanks, 🎅 🐧

yuyichao and others added 6 commits November 22, 2015 20:26
fixes
d:/code/msys64/home/Tony/julia/src/jltypes.c(1910) : error C2664: 'LONG64 _Inter
lockedCompareExchange64(volatile LONG64 *,LONG64,LONG64)' : cannot convert argum
ent 1 from 'volatile uint64_t *' to 'volatile LONG64 *'
        Types pointed to are unrelated; conversion requires reinterpret_cast, C-
style cast or function-style cast
d:/code/msys64/home/Tony/julia/src/jltypes.c(1913) : error C2664: 'LONG64 _Inter
lockedCompareExchange64(volatile LONG64 *,LONG64,LONG64)' : cannot convert argum
ent 1 from 'volatile uint64_t *' to 'volatile LONG64 *'
        Types pointed to are unrelated; conversion requires reinterpret_cast, C-
style cast or function-style cast
d:/code/msys64/home/Tony/julia/src/jltypes.c(1993) : error C2664: 'LONG64 _Inter
lockedCompareExchange64(volatile LONG64 *,LONG64,LONG64)' : cannot convert argum
ent 1 from 'volatile uint64_t *' to 'volatile LONG64 *'
        Types pointed to are unrelated; conversion requires reinterpret_cast, C-
style cast or function-style cast
d:/code/msys64/home/Tony/julia/src/jltypes.c(2000) : error C2664: 'LONG64 _Inter
lockedCompareExchange64(volatile LONG64 *,LONG64,LONG64)' : cannot convert argum
ent 1 from 'volatile uint64_t *' to 'volatile LONG64 *'
        Types pointed to are unrelated; conversion requires reinterpret_cast, C-
style cast or function-style cast

also define sleep for MSVC
Kernel32 Sleep takes milliseconds, ref https://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspx

fixes
d:/code/msys64/home/Tony/julia/src/threading.c(212) : error C3861: 'sleep': iden
tifier not found
d:/code/msys64/home/Tony/julia/src/threading.c(290) : error C3861: 'sleep': iden
tifier not found
…e always `extern "C"`'d

so codegen_mutex and codegen_lock_count do not get c++ mangled
@yuyichao
Copy link
Contributor Author

So I've just run some micro-benchmarks to compare the performance of creating a useless gc frame. The functions I used to benchmark this is (they are the gcframe and no gcframe respectively in the table below).

@noinline function tls_test(a)
    1 > 2 && "a$a"
    a
end

@noinline function non_tls_test(a)
    a
end

The first function generates a useless gcframe for the branch eliminated by llvm.

I've also precompiled these in the sysimg and verified in GDB that for this PR with threading, the sysimg version has one more level of indirection through jl_get_ptls_states while the JIT version calls jl_get_ptls_states_static directly.

Numbers below (time in ns)

master this pr master threading this pr threading
gcframe (sysimg) 1.74 1.74 5.22 3.07
no gcframe (sysimg) 1.09 1.09 1.09 1.09
gcframe 1.53 1.72 5.22 2.93
no gcframe 1.09 1.09 1.09 1.09

The only possible regression is the non-threading GC frame is slower by 1-2 cycles (0.217ns). The extra level of indirection for tls access in sysimg isn't as bad as what I thought.

Overall, I'm pretty happy with the current performance. Maybe we can do better for the tls access (trick the linker or emitting static TLS access if we can recognize that (with llvm support)?) but not for this PR ;-p

Will merge tomorrow if there's no other issues.

@tkelman
Copy link
Contributor

tkelman commented Nov 23, 2015

Let's merge now?

yuyichao added a commit that referenced this pull request Nov 23, 2015
Threading without using patched LLVM
@yuyichao yuyichao merged commit e6aba04 into master Nov 23, 2015
@yuyichao yuyichao deleted the yyc/tls-no-patch branch November 23, 2015 20:40
@yuyichao
Copy link
Contributor Author

Sure.....

@ViralBShah
Copy link
Member

Does this mean we can turn on travis?

@yuyichao
Copy link
Contributor Author

The build should work, but I'm pretty sure there's still bugs (not sure if we can hit those bugs without actually running many threads though). Maybe at least have a buildbot for it? @tkelman @staticfloat

@tkelman
Copy link
Contributor

tkelman commented Nov 24, 2015

It is dev master. I say let's turn threads on by default, everywhere except MSVC.

@yuyichao
Copy link
Contributor Author

I'd prefer to leave it off until we have a stop-the-world GC (rather than a wait-for-the-world GC as @vtjnash phrased it....)

@ViralBShah
Copy link
Member

I thought there was a buildbot for threading.

@staticfloat
Copy link
Member

@yuyichao What build configuration would you want? Just tacking JULIA_ENABLE_THREADING onto the makevars?

@yuyichao
Copy link
Contributor Author

I thought there was a buildbot for threading.

I thought there wasn't since I've heard there's issues getting newer LLVM to build.

What build configuration would you want

Normal build and test with JULIA_THREADS=1 make var.

@staticfloat
Copy link
Member

What is the minimum version of LLVM we can build against? We have Cxx nightlies that actually finish building every now and then, but if there is a specific version we can build against that'd be great.

@tkelman
Copy link
Contributor

tkelman commented Nov 24, 2015

Cxx nightlies are broken on gcc 5 because they are based off a broken point in upstream's history, see JuliaInterop/Cxx.jl#73 (comment).

Threading with LLVM 3.3 should now work, which was the purpose of this PR.

@ViralBShah
Copy link
Member

Perhaps I was thinking of the cxx buildbot. Sorry for the noise.

@staticfloat
Copy link
Member

Alright, I've got a nightly builder that will attempt this stuff, and have a test build forcing its way through right now.

@yuyichao
Copy link
Contributor Author

Seems like it was broken by #14103 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Windows build with JULIA_THREADS = 1 segfaults
6 participants