-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialization despite nospecialize annotation #35131
Comments
FWIW, |
Care to explain? Or a link? I have a vague memory of this but I can't recall the particulars. Is there something else I should be using to measure the number of compiled specializations for each method? |
The specializations in that table have been inferred, but not compiled. We could add some kind of |
I see. Presumably this would have quite a few use-cases. I would guess there's hardly ever a reason to infer calls in a UDPATE: you can combine |
For packages like LoopVectorization and Cassette which do a lot with generated functions, it seems this might decrease latency a fair bit? |
I don't believe we spend a lot of time compiling and running generators, especially since they run in fixed worlds and we add |
In case it helps I pushed a PR you can use as a real-world test case: JuliaSIMD/LoopVectorization.jl#76 |
I tried running the LoopVectorization tests with ENABLE_TIMINGS, but with all the compiler timers disabled so that the time would be divided between ROOT and STAGED_FUNCTION. Staged is about 5%. The full profile is a bit unusual though:
It's rare to see that much time in codegen. |
Does the 5% STAGED_FUNCTION include things triggered by running the generator, but which are not Interesting, though, that's it's only 15% inference. |
Yes, it includes everything that runs while doing code generation. That also brings up why it's a bit difficult to use different compilation parameters for generators --- a generator might call any function, and we probably do want to infer most/all of those, so it's hard to know where to draw the line. |
I see what you mean. I'd guess that most of the code in LoopVectorization itself could happily run in the interpreter, but lots of the Base methods they call are generically useful and would probably be better run in compiled mode. |
This would be a great use case for per-module optimization levels; hopefully pretty soon we'll be able to apply -O0 to the whole package. |
Presumably some of that compile time is for the actual |
I think I figured out the unusually large codegen time. It's all in emit_llvmcall, presumably from SIMDPirates. It's not common for code to use a huge number of llvmcalls, so it hasn't been optimized in the compiler much. A lot of time is just in generating unique names and parsing the llvm. |
In my local tests (just running the profiler) I'm seeing a bit larger fraction for inference; a bit over 7/25ths of the time is in julia> versioninfo()
Julia Version 1.5.0-DEV.458
Commit fa5e92445c* (2020-03-14 19:10 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
JULIA_CPU_THREADS = 4 and master on LoopVectorization and VectorizationBase. |
Also worth noting that LoopVectorization has a large precompile file to try to reduce the amount of time spent on inference. Since that precompile file was generated by running the tests, it's essentially hand-crafted to minimize the amount of inference time while running the tests. This will not, however, be applicable for real-world uses of LoopVectorization since consumers can't require precompilation of methods of LoopVectorization itself. Master is especially relevant because of JuliaSIMD/LoopVectorization.jl#75, that added some type parameters to a type that doesn't have them in released versions of the package. |
LoopVectorization does a lot of work to generate the functions, including iterating over a lot of loop orders and unrolling combinations, and running the cost model on each of them. If per module compilation were implemented, and someone called a generated function from the module, how would the generated code be optimized? While O0 or maybe O1 would be best for code generation, I still need the actual generated code to be compiled with O2 or O3. While I'm not using LLVM for autovectorization, I'm still relying on it for things like dead code elimination and inst combine, and not worrying about making indexing efficient. But I guess this would be as simple to solve as having multiple modules within LoopVectorization, so that Quickly testing after deleting the body of the precompile function,
Out of curiosity, any idea how LLVM.jl's performance compares with llvmcall? |
It seems time for an update here. Using a recent Julia master with
So about 30% on lowering+methodlookup+inference and 50% LLVM. Here's a summary of what SnoopCompile tells me of where it's spending its inference time:
At this point I'm reasonably satisfied that there is no obvious low-hanging fruit. I'm fine with seeing this issue closed. |
Can anyone else confirm this? tim@diva:~/src/julia-master$ time julia --startup-file=no -e 'println(VERSION); using LoopVectorization'
1.4.2-pre.0
real 0m6.005s
user 0m6.042s
sys 0m0.252s
tim@diva:~/src/julia-master$ time julia-master --startup-file=no -e 'println(VERSION); using LoopVectorization'
1.5.0-DEV.875
real 0m0.666s
user 0m0.739s
sys 0m0.221s I first thought I'd slipped a decimal point. If that's true, it's absolutely incredible! |
On 1.1:
Master:
This is around a 20% improvement, but much less than 10x. How reliable is timing Travis runs between versions? Julia 1.4's tests complete at least 10% faster than 1.1. |
Definitely already precompiled on 1.4. I also just ran a fresh compilation on 1.4 and EDIT: actually, a recompile fixed it. When I first tested the recompile I inadvertently ran the tim@diva:~/src/julia-1$ time julia --startup-file=no -e 'println(VERSION); using LoopVectorization'
1.4.2-pre.0
real 0m0.990s
user 0m1.070s
sys 0m0.212s which is much more in line with my general experience. Earlier today I was messing around with PackageCompiler. Nominally I had |
Hm, perhaps this snippet from the manual section Be aware of when Julia avoids specializing is slightly confusing then? "Note that @code_typed and friends will always show you specialized code, even if Julia would not normally specialize that method call. You need to check the [method internals](@ref ast-lowered-method) if you want to see whether specializations are generated when argument types are changed, i.e., if (@which f(...)).specializations contains specializations for the argument in question." In that the distinction between inferred and compiled isn't explicit? Or am I not following? :) |
Using
nspecializations
from here, I find the following behavior strange:This is a MWE of a bigger problem: in LoopVectorization, I've measured at least 33 specializations of
abstractparameters
despite the fact that both arguments are marked as not being specialiizable. In that case there are specializations on both parameters. Here's 30 of them:The text was updated successfully, but these errors were encountered: