-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extrema is slower than maximum + minimum #31442
Comments
@mbauman Can you provide the test case you used? I ran the following code:
and got this output:
julia version:
|
Yes, that test case reproduces the behavior on master for me. It won't on 1.1.0 as the PR that improved maximum and minimum was after 1.1.0. Of course, note that you'll also need to have a relatively recent computer to see the difference. |
I got a slower
|
I would like to try working on this, and it would be great to hear what kind of patch is expected. I was thinking maybe One question I have is about the unrolling. Ideally this should be adapted for each architecture and data type. How much of this should be up to LLVM or to Julia? Should we prioritize generating code that the LLVM can vectorize? The alternative I imagine would be controlling parameters like this unrolling depending on the architecture. |
Thinking a bit more about this, maybe the best of worlds would be if this function could simply be implemented with a general fold-left, and it should result in a good vectorized machine code relying on the LLVM automatic vectorization. Should we concentrate our efforts on making sure that a generic fold works better in more cases, or should we first make sure that a few specific functions have high-performance implementations? |
I was thinking of the much simpler option of copy-pasting the techniques used in maximum/minimum (here: Lines 481 to 518 in 2b49d26
extrema implementation (here: Lines 471 to 483 in 2b49d26
Revamping mapreduce to support multiple functions is a much bigger task (and I'm not sure we want to add that complication to the already complex mapreduce machinery, either), and while improving LLVM's automatic vectorization is always a worthy goal it's particularly challenging in cases like these. |
I am still trying to understand why the specialization of The generic This idea I had at first of supporting multiple functions doesn't really make sense, sorry. I am thinking now on something much simpler, which would be to implement this function as a fold like: Which of course we can specialize if it makes sense. What I see now is that this doesn't seem to vectorize, and I wonder if just refactoring It doesn't seem to matter for this specific function, though, because from what I can tell the current |
I quickly tried to use the |
Note that just seeing vector instructions in For example, you can see that
Key things to see are the |
|
Alright, thanks! I need to get more used with reading the code_llvm output. What is a good reference for Intel instructions? And what does the 'v' mean in this case? Anyways I guess it all means we really need to change this loop somehow to enable the vectorization. Will try something later. |
I'm not an assembly expert but here's my understanding: The v prefix is for a "VEX" (vector extensions) encoded assembly instruction. It's working on the I find LLVM IR imminently more understandable and readable — and the giant language reference page is a good resource. |
OK, it took me long enough understand the whole thing and go trough my grief process. It seems the LLVM will really refuse to vectorize it unless we do this unrolling. I have prepared a first draft that shows a speedup. It is still missing many details like the chunky traversal. I hope I can prepare a PR by tomorrow, but it would be great to hear any comments. In especial, would this use of
|
I was checking how things are working on the LLVM side, and I see the relevant patch was merged months ago and made it to 8.0. Wouldn't this be a good candidate for the Julia LLVM patches, then? It wasn't the case before because the LLVM patch was still open. I'm proposing to fix the issue by a Julia LLVM patch plus updated |
I tried including the tiny D53680 LLVM patch into Julia, not sure I did anything wrong but nothing came out of it. I would really like to understand better what could be happening, because I am not so sure there is just some small thing lacking on LLVM 6.0 that may soon just go away. Testing with clang 6.0.0 I can produce vectorized code as long as fast-mode is used, nicely adapted to e.g. AVX-512. And it all works fine for integers. On Julia it works fine with integers and floats with I was trying to use some LLVM intrinsics to see if it had any effect, like |
Wow, nice investigations! When I tagged this as looking for help, I was thinking it'd be a simple copy-pasting of the technique used in #30320 — but you've definitely taken it to the next level. This is great. On the Julia side, yes, I think it'd be just fine to use the Cartesian macros since they're also used elsewhere in the multidimensional.jl file. We had been slowly replacing them as they're a bit hard to read, but they're here for now. You could also just write out (or get the macro to write out) the expression literally: julia> @macroexpand Base.Cartesian.@ntuple 8 j->minornan(vmin[j], itr[i+j])
:((minornan(vmin[1], itr[i + 1]), minornan(vmin[2], itr[i + 2]), minornan(vmin[3], itr[i + 3]), minornan(vmin[4], itr[i + 4]), minornan(vmin[5], itr[i + 5]), minornan(vmin[6], itr[i + 6]), minornan(vmin[7], itr[i + 7]), minornan(vmin[8], itr[i + 8]))) A few other notes: you'll need to grab the final elements from the iterator (if it's not an even multiple of I'm not as familiar with the LLVM side of things, but it may indeed be worth investigating what version 8 might bring. I don't know how far we are from version 8 compatibility — 7 was just added a few weeks ago. That patch might not stand alone as a patch to version 6. |
This code vectorizes on clang, using just comparisons and in fast mode: https://godbolt.org/z/tF8SSH
This (addition) vectorizes in Julia even without fast mode (fast mode is still being set in the ir from what i understand): This does not vectorize in Julia even with fast mode: |
I did the following test today: first I produced logs from the LLVM passes for 4 functions, calculating the sum and min from a vector of integers or floats (with fast-math). It is possible to see that the functions are very similar up to the point the loop vectorizer runs, when it doesn't work for the case of the min on floats, as expected. I've put the logs here. I can't make much out of it, but maybe someone has any idea about the point we see the first difference, at line 174 where the float version reads br label %L88, !dbg !116 and the int %min.iters.check = icmp ult i64 %42, 16, !dbg !115
br i1 %min.iters.check, label %scalar.ph, label %vector.ph, !dbg !115 After that I also generated dumps from clang, where the vectorization happens. The dumped IR can be run with llvmcall with slight modifications, and then we can find out vectorization still doesn't happen for floats in Julia. I guess that means the Julia IR is fine and Clang isn't doing anything special there, so it may have something to do with the way LLVM is being configured by Julia? using BenchmarkTools
using Test
mymin(itr::Vector{Float32})=
Base.llvmcall("""
%ptr = inttoptr i64 %0 to float*
%2 = load float, float* %ptr, align 4
br label %5
; <label>:3: ; preds = %5
%4 = phi float [ %11, %5 ]
ret float %4
; <label>:5: ; preds = %5, %1
%6 = phi i64 [ 1, %1 ], [ %12, %5 ]
%7 = phi float [ %2, %1 ], [ %11, %5 ]
%8 = getelementptr inbounds float, float* %ptr, i64 %6
%9 = load float, float* %8, align 4
%10 = fcmp fast olt float %7, %9
%11 = select i1 %10, float %7, float %9
%12 = add nuw nsw i64 %6, 1
%13 = icmp eq i64 %12, 65536
br i1 %13, label %3, label %5
""", Float32, Tuple{Ptr{Float32}}, pointer(itr))
mymin(itr::Vector{Int32})=
Base.llvmcall("""
%ptr = inttoptr i64 %0 to i32*
%2 = load i32, i32* %ptr, align 4
br label %5
; <label>:3: ; preds = %5
%4 = phi i32 [ %11, %5 ]
ret i32 %4
; <label>:5: ; preds = %5, %1
%6 = phi i64 [ 1, %1 ], [ %12, %5 ]
%7 = phi i32 [ %2, %1 ], [ %11, %5 ]
%8 = getelementptr inbounds i32, i32* %ptr, i64 %6
%9 = load i32, i32* %8, align 4
%10 = icmp slt i32 %7, %9
%11 = select i1 %10, i32 %7, i32 %9
%12 = add nuw nsw i64 %6, 1
%13 = icmp eq i64 %12, 65536
br i1 %13, label %3, label %5
""", Int32, Tuple{Ptr{Int32}}, pointer(itr))
data = rand(Float32, 2^16)
# data = rand(Int32, 2^16)
@test mymin(data) == minimum(data)
@code_native mymin(data) EDIT: running this code with |
I have finally seen trouble while working strictly outside Julia: I produced the LLVM IR from the int and float versions, and I was not able to produce vectorized code with https://gist.github.com/nlw0/eb50a88845eba90b6b859154d2f2383d |
I finally started looking at the LLVM analysis log for the loop vectorizer via function myfun(itr)
@inbounds begin
v = itr[1]
@simd for y in itr[2:end]
@fastmath v = ifelse(v < y,v,y)
end
v
end
end
# a = collect(0x00000001:0x00010000)
a = collect(1f0:1f4)
myfun(a)
Straight away what you can see is that |
I've run Julia under gdb and made a trace of the moment the loop vectorizer diverges between the integer and float versions of the https://gist.github.com/nlw0/58ed9fda8e8944a9cb5e5a20f6038fcf |
I'm writing a lot of stuff here, only some of which makes actual sense, so apologies for the noise. I believe I am finally getting close to the root cause of this issue, though. The LLVM loop vectorizer is expecting the function to carry a "no-nans-fp-math" attribute. The test is done here: https://github.com/llvm/llvm-project/blob/d359f2096850c68b708bc25a7baca4282945949f/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L5324 As far as I can tell Julia never sets this, and fast-math mode only affects instructions down at the leaves of the expression trees. This is why there is no optimization even though fast mode is active. I believe this is where this function attribute is set by clang https://github.com/llvm-mirror/clang/blob/31c307cf96ef1b6a4d0c542d18ebfd7a307ae9b0/lib/CodeGen/CGCall.cpp#L1747-L1748 I tried hacking this attribute it, but there seems to be something else still missing. Hope I can make some more progress soon. |
OK, there was just something missing still in order to see it in function mymin(itr)
@inbounds begin
v = itr[1]
@simd for y in itr[2:65536]
@fastmath v = ifelse(v < y,v,y)
end
v
end
end
```assembly
And have `@code_native` show this:
With that I suppose we can start thinking of a proper solution, and since this involves code generation and maybe decisions about what |
I did some tests in my hacked Julia compiler, benchmarking
|
Hi, during the development of my package I encountered a problem with
Now I want to find minimum and maximum in one run
I expect this not to allocate at all but it allocates much more (610 MiB) that the size of both arrays summed (152 MiB)
First I thought it is a problem with flatten iterator, but it looks ok for me and causes problems only with extrema function
When I defined my own extrema implementation problem does not occur. function my_extrema(itr)
vmin = vmax = first(itr)
for item in itr
vmax = max(item, vmax)
vmin = min(item, vmin)
end
return (vmin, vmax)
end
I'm not sure if this is related to this issue, but clearly something is wrong with this function, I cannot reason what and it happens only with flatten iterator over a tuple of arrays. |
Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation. |
@nalimilan, I think I am close to fix that, so please give me some time and will open pull request instead of issue, or at least an issue with more info. |
Much more complicated than I expected and probably needs to be fixed after code lowering: #34385 |
Here you go with the simple and fast implementation I'm using routinely (without the NaN test as it is not applicable, remove extra isnan test )
Now lest bench functions
Without isnan test
|
Your extrema functions works however from an algorithm optim perspective why should you scan twice the array to get min and max ? |
The problem I pointed is not about this particular calculation but rather too much memory allocation that manifests when using extrema with flatten iterator and potentially other cases that follow similar iteration pattern. |
On AVX-512 systems, calling
extrema(x)
is slower than callingmaximum
and thenminimum
. Looks like we need to incorporate the same techniques from #30320 over to extrema in order to get it to vectorize in a similar manner.The text was updated successfully, but these errors were encountered: