Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

while i <= L makes function >5x slower the while i < (L+1) #31416

Closed
ndinsmore opened this issue Mar 20, 2019 · 27 comments
Closed

while i <= L makes function >5x slower the while i < (L+1) #31416

ndinsmore opened this issue Mar 20, 2019 · 27 comments

Comments

@ndinsmore
Copy link
Contributor

There seems to be a problem when <= is used in a while loop, where the entire function containing the while loop is >5x slower than the while loop that uses <.

I think that this might be preventing optimizations on inline iterators as well, I can give more details on that as well if needed.

I used the following benchmark functions to identify the problem:

using BenchmarkTools 
function while_vect_test_ne(vec::Vector{T}) where T
    ret=T(0)
    L=length(vec)
    i=1
    while  i != (L+1)
        @inbounds v=vec[i]
        ret+=v
        i += 1; 
    end
    return ret
end

function while_vect_test_lt(vec::Vector{T}) where T
    ret=T(0)
    L=length(vec)::Int
    i=1::Int
    while  i < (L+1)
        @inbounds v=vec[i]
        ret+=v
        i += 1; 
    end
    return ret
end
#this is the bad one
function while_vect_test_lte(vec::Vector{T}) where T
    ret=T(0)
    L=length(vec)::Int
    i=1::Int
    while  i <= L
        @inbounds v=vec[i]
        ret+=v
        i += 1; 
    end
    return ret
end

vec=collect(1:1000)

@benchmark while_vect_test_ne(vec)

@benchmark while_vect_test_lt(vec)

@benchmark while_vect_test_lte(vec)

Results

julia> @benchmark while_vect_test_ne(vec)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     66.301 ns (0.00% GC)
  median time:      70.757 ns (0.00% GC)
  mean time:        86.534 ns (9.57% GC)
  maximum time:     55.662 μs (99.82% GC)
  --------------
  samples:          10000
  evals/sample:     976

julia> @benchmark while_vect_test_lt(vec)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     66.695 ns (0.00% GC)
  median time:      71.407 ns (0.00% GC)
  mean time:        94.916 ns (10.02% GC)
  maximum time:     67.625 μs (99.83% GC)
  --------------
  samples:          10000
  evals/sample:     968

julia> @benchmark while_vect_test_lte(vec)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     416.070 ns (0.00% GC)
  median time:      436.558 ns (0.00% GC)
  mean time:        509.744 ns (2.33% GC)
  maximum time:     119.429 μs (99.48% GC)
  --------------
  samples:          10000
  evals/sample:     199
@StefanKarpinski
Copy link
Member

Note that the two conditions are not strictly identical because of integer wrap-around.

@vchuravy
Copy link
Member

vchuravy commented Mar 20, 2019

when L==typemax(Int), L+1 will overflow causing a infinite loop. LLVM (mostly SCEV) will determine that the loop-bounds are unknown and not vectorize this loop.

x-ref: #31011

the "right way" of writing the second loop is:

function while_vect_test_lte(vec::Vector{T}) where T
    ret=T(0)
    L=length(vec)::Int
    i=0::Int
    while  i < L
        i += 1
        @inbounds v=vec[i]
        ret+=v
    end
    return ret
end

@StefanKarpinski
Copy link
Member

The weird part is that I would expect the non-vectorized versions to be slower, not faster.

@ndinsmore
Copy link
Contributor Author

Updating

function while_vect_test_lt(vec::Vector{T}) where T
    ret=T(0)
    L=length(vec)::Int
    i=0::Int
    while  i < L
        i += 1;
        @inbounds v=vec[i]
        ret+=v
    end
    return ret
end

Yields:

julia> @benchmark while_vect_test_lt(vec)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     65.307 ns (0.00% GC)
  median time:      69.715 ns (0.00% GC)
  mean time:        84.524 ns (8.87% GC)
  maximum time:     55.671 μs (99.78% GC)
  --------------
  samples:          10000
  evals/sample:     976

@KristofferC
Copy link
Member

Both the first and second loop are vectorized and are fast. The third loop is not vectorized and is slow.

@StefanKarpinski
Copy link
Member

Then isn't the explanation backwards since those are the two that compare with L+1?

@KristofferC
Copy link
Member

KristofferC commented Mar 21, 2019

If L+1 overflows the first loop will loop until i overflows and the second one will loop 0 times.

@StefanKarpinski
Copy link
Member

Oh duh, they both terminate while i ≤ L is always true so it's an infinite loop.

@StefanKarpinski
Copy link
Member

Not sure there's anything to do here: Julia and LLVM seem to be handling this correctly.

@vchuravy
Copy link
Member

I agree that is the semantics that we have and LLVM handles them correctly. I will note that using add nsw (no-signed wrap) is an alternative, but avoiding the overflow is better in general.

@ndinsmore you mentioned:

I think that this might be preventing optimizations on inline iterators as well, I can give more details on that as well if needed.

Do you have more information on that?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Mar 21, 2019

Also, note that if you use a range as is idiomatic, then this has good performance:

function vect_test_range(vec::Vector{T}) where T
    ret = T(0)
    L = length(vec)
    for i = 1:L
        @inbounds v = vec[i]
        ret += v
    end
    return ret
end

Unfortunately, the even more idiomatic version just doing iteration is less fast:

function vect_test_iterate(vec::Vector{T}) where T
    ret = T(0)
    for v in vec
        ret += v
    end
    return ret
end

Do we have an issue tracking the iteration protocol being able to automatically @inbounds things? cc @Keno, @mbauman

All my timings:

@belapsed(while_vect_test_ne($vec))  = 3.098590130916415e-8
@belapsed(while_vect_test_lt($vec))  = 3.104028197381672e-8
@belapsed(while_vect_test_lte($vec)) = 2.8979411764705885e-7
@belapsed(vect_test_range($vec))     = 3.093655589123867e-8
@belapsed(vect_test_iterate($vec))   = 5.7927624872579005e-8

@KristofferC
Copy link
Member

KristofferC commented Mar 21, 2019

Looks the same to me?

julia> @btime  vect_test_range(vec)
  538.303 ns (1 allocation: 16 bytes)
50005000

julia> @btime  vect_test_iterate(vec)
  545.455 ns (1 allocation: 16 bytes)
50005000

#27386 should have fixed that.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Mar 21, 2019

I'm getting consistently different timings:

julia> @btime  vect_test_range(vec)
  47.356 ns (1 allocation: 16 bytes)
500500

julia> @btime  vect_test_iterate(vec)
  73.695 ns (1 allocation: 16 bytes)
500500

julia> @btime  vect_test_range($vec)
  31.103 ns (0 allocations: 0 bytes)
500500

julia> @btime  vect_test_iterate($vec)
  57.703 ns (0 allocations: 0 bytes)
500500

Will build a fresh master and see if anything changes...

@KristofferC
Copy link
Member

KristofferC commented Mar 21, 2019

For small arrays I see some difference but it seems it goes away when increasing the size a bit:

julia> vec=collect(1:1000);

julia> @btime  vect_test_range(vec)
  42.380 ns (1 allocation: 16 bytes)
500500

julia> @btime  vect_test_iterate(vec)
  66.700 ns (1 allocation: 16 bytes)
500500

julia> vec=collect(1:5000);

julia> @btime  vect_test_range(vec)
  283.272 ns (1 allocation: 16 bytes)
12502500

julia> @btime  vect_test_iterate(vec)
  283.392 ns (1 allocation: 16 bytes)
12502500

@StefanKarpinski
Copy link
Member

Ok, that's interesting. Not sure how much we should be bothered by that... I am a little.

@ndinsmore
Copy link
Contributor Author

So how would you write a non-infinite loop using <= ? That seems like a huge performance pitfall.

@StefanKarpinski
Copy link
Member

This isn't a Julia-specific problem, it's a general "integers overflow" problem. C avoids this by making integer overflow UB, so the compiler can do anything it wants, but we don't want to introduce UB. You could check that L < typemax(T) beforehand which allows LLVM to prove that it won't overflow.

At a higher level, why are you writing while loops rather than iterating over a range?

@ndinsmore
Copy link
Contributor Author

ndinsmore commented Mar 21, 2019

@vchuravy

I think that this might be preventing optimizations on inline iterators as well, I can give more details on that as well if needed.

Do you have more information on that?
This is less compelling than it was when I first discovered it and is potential coincidental.

So it started with an obsession the performance of iteration on dicts:
Starting with a test function like:

using BenchmarkTools
function iterate_test_func(iteration_function,f,iter,N=1)
    ret=0
    for n in 1:N
        ret=0
        @inbounds next = iteration_function(iter)
        while next !== nothing
            (v, state) = next
            ret+=f(v)
            @inbounds next = iteration_function(iter, state)
        end
    end
    return ret
end

d=Dict(v=>v for v in 1:1000)
@benchmark iterate_test_func(iterate,v->v,values(d))
@benchmark iterate_test_func(iterate,v->v.second,d)

results

julia> @benchmark iterate_test_func(iterate,v->v,values(d))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     4.223 μs (0.00% GC)
  median time:      4.953 μs (0.00% GC)
  mean time:        5.478 μs (0.00% GC)
  maximum time:     125.694 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark iterate_test_func(iterate,v->v.second,d)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     4.522 μs (0.00% GC)
  median time:      5.210 μs (0.00% GC)
  mean time:        5.675 μs (0.00% GC)
  maximum time:     135.197 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

Then to see if iterate could be faster I wrote something like:
(Which I don't think is an infinate loop)

 Base.@propagate_inbounds @inline function iterate_slim(v::T,  i = v.dict.idxfloor) where T <: Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}}
    d=v.dict
    slots=d.slots
    L = length(slots)
    while i <= L
            @inbounds if slots[i] == 0x1
                @inbounds return ((T <: Base.KeySet) ? d.keys[i] : d.vals[i], i+1)
            end
            i += 1
    end
    return nothing
end

julia> @benchmark iterate_test_func(iterate_slim,v->v,values(d),10)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     36.138 μs (0.00% GC)
  median time:      39.947 μs (0.00% GC)
  mean time:        42.563 μs (0.00% GC)
  maximum time:     827.562 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Now knowing about the pitfall of <=: (Very small code change)

 Base.@propagate_inbounds @inline function iterate_slim_lt(v::T,  i = v.dict.idxfloor-1) where T <: Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}}
    d=v.dict
    slots=d.slots
    L = length(slots)
    while i < L
            i += 1
            @inbounds if slots[i] == 0x1
                @inbounds return ((T <: Base.KeySet) ? d.keys[i] : d.vals[i], i+1)
            end
    end
    return nothing
end

julia>  @benchmark iterate_test_func(iterate_slim_lt,v->v,values(d),1)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.257 μs (0.00% GC)
  median time:      2.696 μs (0.00% GC)
  mean time:        3.100 μs (0.00% GC)
  maximum time:     132.669 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

Finally at one point I was trying to understand entitlement so I wrote:
(Which again I believe is a non-infinite loop). The coisidence being that this version using a <= is about the same order of magnitude as the vanilla iterate.

function slow_entitlement_test_func(f,iter::Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}},N=1)
    ret=0
    @inbounds for n in 1:N
        ret=0
        d=iter.dict
        slots=d.slots
        L=length(slots)
        vals=d.vals
        i=d.idxfloor
        while i <= L
            @inbounds if slots[i] === 0x1
                @inbounds v=vals[i]
                ret+=f(v)
            end
            i+=1
        end
    end
    return ret
end


julia> @benchmark slow_entitlement_test_func(v->v,values(d))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     6.702 μs (0.00% GC)
  median time:      7.512 μs (0.00% GC)
  mean time:        8.123 μs (0.00% GC)
  maximum time:     128.705 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

Now get rid of <= we get blazing a fast loop:
(Which I am still trying to understand why doesn't the iterate_slim_lt version above get optimize to nearly the same code.)

function entitlement_test_func(f,iter::Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}},N=1)
    ret=0
    @inbounds for n in 1:N
        ret=0
        d=iter.dict
        slots=d.slots
        L=length(slots)
        vals=d.vals

        for i = d.idxfloor:L
            @inbounds if slots[i] === 0x1
                    @inbounds v=vals[i]
                    ret+=f(v)
            end
            i+=1
        end
    end
    return ret
end

julia> @benchmark entitlement_test_func(v->v,values(d))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     979.353 ns (0.00% GC)
  median time:      1.005 μs (0.00% GC)
  mean time:        1.067 μs (0.00% GC)
  maximum time:     8.688 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     17

@ndinsmore
Copy link
Contributor Author

@StefanKarpinski I think that iteration through a Dict, pairs or values should be 10X faster on the order of entitlement above.
I think that the function-blocking the optimization is:

function skip_deleted(h::Dict, i)
    L = length(h.slots)
    @inbounds while i<=L && !isslotfilled(h,i)
        i += 1
    end
    return i
end

But have yet been able to be able to prove that.

@StefanKarpinski
Copy link
Member

Makes sense. It could be changed to

function skip_deleted(h::Dict, i)
    L = length(h.slots)
    @inbounds while true
        if i >= L || isslotfilled(h, i)
            return i
        end
        i += 1
    end
end

That should be provably non-infinite since i will eventually be at least as large as L.

@KristofferC
Copy link
Member

KristofferC commented Mar 21, 2019

The || !isslotfilled(h,i) and the early return should kill any vectorization attempts anyway, no?

@ndinsmore
Copy link
Contributor Author

This gives a 2x improvement:

julia> @benchmark iterate_test_func(iterate,v->v,values(d),1)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     4.329 μs (0.00% GC)
  median time:      4.949 μs (0.00% GC)
  mean time:        5.340 μs (0.00% GC)
  maximum time:     119.739 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark iterate_test_func(iterate,v->v.second,d,1)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     5.056 μs (0.00% GC)
  median time:      6.266 μs (0.00% GC)
  mean time:        6.722 μs (0.00% GC)
  maximum time:     446.593 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> function Base.skip_deleted(h::Dict, i)
           L = length(h.slots)
           for i= i:L 
               @inbounds if Base.isslotfilled(h,i)
                   return i
               end
           end
           return L+1
       end

julia> @benchmark iterate_test_func(iterate,v->v,values(d),1)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.623 μs (0.00% GC)
  median time:      2.839 μs (0.00% GC)
  mean time:        3.083 μs (0.00% GC)
  maximum time:     25.621 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark iterate_test_func(iterate,v->v.second,d,1)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     3.832 μs (0.00% GC)
  median time:      5.146 μs (0.00% GC)
  mean time:        5.316 μs (0.00% GC)
  maximum time:     51.718 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

@ndinsmore
Copy link
Contributor Author

is the if statement in the following killing vectorization?

 function Base.skip_deleted(h::Dict, i)
           L = length(h.slots)
           @inbounds while i<=L
               if Base.isslotfilled(h,i)
                   return i
               end
               i += 1
           end
           return i
       end

Because isn't it provably non-infinite?
It provides no real benefit.

@ndinsmore
Copy link
Contributor Author

ndinsmore commented Mar 21, 2019

@StefanKarpinski @KristofferC @vchuravy
Do you have any idea why the following loop is so much slower(8x) than the for loop? when I dug in it came down to the <= , but isn't this a non-infinite loop?

function slow_entitlement_test_func(f,iter::Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}},N=1)
    ret=0
    @inbounds for n in 1:N
        ret=0
        d=iter.dict
        slots=d.slots
        L=length(slots)
        vals=d.vals
        i=d.idxfloor
        while i <= L
         #for i = d.idxfloor:L
            @inbounds if slots[i] === 0x1
                @inbounds v=vals[i]
                ret+=f(v)
            end
            i+=1 # Also comment out for the for loop
        end
    end
    return ret
end

This actually the source of this whole issue.
For the performance, results see above.

@vchuravy
Copy link
Member

What happens if i=L=typemax(Int64)? You will enter the loop, and then i will overflow and wrap around.

@ndinsmore
Copy link
Contributor Author

@vchuravy is there a way to tell the optimizer that it can't happen even if it appears it could?
I tried @assert L < typemax(Int), because in this case seeing as is a linear array and it could not happen since most 64 bit systems limit memory to something less the 2^64 spaces.

@vchuravy
Copy link
Member

vchuravy commented Mar 21, 2019

While I do not recommend this, there is always a way.

    function unsafe_inc(i::Int64)
        Base.llvmcall("""
        %i = add nsw i64 %0, 1
        ret i64 %i
        """, Int64, Tuple{Int64}, i)
    end

function slow_entitlement_test_func(f,iter::Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}},N=1)
    ret=0
    @inbounds for n in 1:N
        ret=0
        d=iter.dict
        slots=d.slots
        L=length(slots)
        vals=d.vals
        i=d.idxfloor
        while i <= L
         #for i = d.idxfloor:L
            @inbounds if slots[i] === 0x1
                @inbounds v=vals[i]
                ret+=f(v)
            end
            i = unsafe_inc(i)
        end
    end
    return ret
end

Since this is not a bug in Julia, but due to the defined semantics I am going to close this issue.
Please continue discussion on discourse.julialang.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants