Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues #3

Closed
rdeits opened this issue Nov 29, 2016 · 7 comments
Closed

Performance issues #3

rdeits opened this issue Nov 29, 2016 · 7 comments

Comments

@rdeits
Copy link
Collaborator

rdeits commented Nov 29, 2016

I've started looking into what it will take to make MultivariatePolynomials fast as a general-purpose polynomial tool. Specifically, I'm trying to focus on variable substitution for now. Currently, it's pretty slow:

@polyvar x y
p = x + y + 2x^2 + y^3
vars = [x, y]
vals = [1.0, 2.0]

calling p(vals, vars) takes around 3 microseconds.

I've gotten that down to about 270 nanoseconds in #2, but doing any better will require some internal changes (by comparison, MultiPoly takes about 70 nanoseconds on the same test).

I think the fundamental problem is that we evaluate polynomials by iterating over the Terms in the polynomial. But that iteration requires constructing a new Term type for each element, which in turn has to allocate new Vectors of monomials and powers. That gets expensive and slow. Ideally, we should be able to evaluate a polynomial with zero allocations at all.

One way we could possibly achieve this is by changing the polynomial types to just store a vector of Terms, rather than constructing a new Term when we iterate. Is there a reason that wasn't done already? Does it make some other part of the code more difficult?

@blegat
Copy link
Member

blegat commented Nov 29, 2016

Storing a vector of Term would store duplicated instances of the Vector{Polyvar}. The solution would be to not create any Term

@blegat
Copy link
Member

blegat commented Nov 30, 2016

I think that the fact that iterating over a polynomials gives it terms is rather user friendly it might be ok if it is the the most efficient way to iterate over the term if this spares the copying of the Vector{PolyVar}.
Now the fact that I use this user friendly construct in critical places in MultivariatePolynomials is clearly inefficient and was only temporary.

@blegat
Copy link
Member

blegat commented Nov 30, 2016

I have replaced the for t in p by a more efficient version for subs in dbf3a3a :)

@rdeits
Copy link
Collaborator Author

rdeits commented Nov 30, 2016

Great! I may look into this more in the future, but for now I'll close the issue.

@rdeits rdeits closed this as completed Nov 30, 2016
@blegat
Copy link
Member

blegat commented Nov 30, 2016

What is the performance now ? I get the following for MultiPoly

BenchmarkTools.Trial: 
  memory estimate:  144.00 bytes
  allocs estimate:  5
  --------------
  minimum time:     80.256 ns (0.00% GC)
  median time:      83.010 ns (0.00% GC)
  mean time:        96.295 ns (11.78% GC)
  maximum time:     2.060 μs (94.59% GC)
  --------------
  samples:          10000
  evals/sample:     968
  time tolerance:   5.00%
  memory tolerance: 1.00%

and the following for this package

BenchmarkTools.Trial: 
  memory estimate:  96.00 bytes
  allocs estimate:  3
  --------------
  minimum time:     108.910 ns (0.00% GC)
  median time:      110.604 ns (0.00% GC)
  mean time:        117.969 ns (4.26% GC)
  maximum time:     1.934 μs (92.85% GC)
  --------------
  samples:          10000
  evals/sample:     932
  time tolerance:   5.00%
  memory tolerance: 1.00%

The performance difference seems to be due to the comparison vars == varorder since when I replace it with true I get

BenchmarkTools.Trial: 
  memory estimate:  32.00 bytes
  allocs estimate:  1
  --------------
  minimum time:     71.542 ns (0.00% GC)
  median time:      72.920 ns (0.00% GC)
  mean time:        75.062 ns (2.10% GC)
  maximum time:     1.383 μs (92.97% GC)
  --------------
  samples:          10000
  evals/sample:     974
  time tolerance:   5.00%
  memory tolerance: 1.00%

I find it rather dangerous to pass values for the variables without providing the variable order that is meant like MultiPoly does except if we are very clear about what is the default ordering.
IIRC we decided that the PolyVar order would not be between name but rather with their creation order which would be stored as an Int.

Indeed comparison between int is much faster than between symbols:

julia> @benchmark(:a < :b)
BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.473 ns (0.00% GC)
  median time:      7.480 ns (0.00% GC)
  mean time:        7.621 ns (0.00% GC)
  maximum time:     23.785 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark(1 < 2)
BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.793 ns (0.00% GC)
  median time:      1.850 ns (0.00% GC)
  mean time:        1.853 ns (0.00% GC)
  maximum time:     9.328 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%

The Int could be retrieved like so parse(Int, string(gensym())[3:end]). Even if we were clear about the fact that we treat the default ordering as the creation order, I still find it dangerous to encourage use to not specify the order since with parallel computing this ordering can become undeterministic...
What do you think ?

  • What is the best ordering ? Symbol or Int ?
  • Should we add the possibility to use substitution without giving the ordering or is it too dangerous ?

@blegat
Copy link
Member

blegat commented Nov 30, 2016

One cause of the performace issue is the inefficiency of == in Julia Base. There is no specific method for Vector so it calls the method

function (==)(A::AbstractArray, B::AbstractArray)
    if indices(A) != indices(B)
        return false
    end
    if isa(A,Range) != isa(B,Range)
        return false
    end
    for (a, b) in zip(A, B)
        if !(a == b)
            return false
        end
    end
    return true
end

which is slower than doing

function myeq(A::Vector, B::Vector)
    len = length(A)
    if length(B) != len
        return false
    end
    for i in 1:len
        if !(A[i] == B[i])
            return false
        end
    end
    return true
end

@blegat
Copy link
Member

blegat commented Nov 30, 2016

More details, on Julia v0.5, if I define:

type A
    a
end
import Base.==
(==)(a::A, b::A) = a.a == b.a

function myeq00(A::Vector, B::Vector)
    if indices(A) != indices(B)
        return false
    end
    for (a, b) in zip(A, B)
        if !(a == b)
            return false
        end
    end
    return true
end
function myeq01(A::Vector, B::Vector)
    if indices(A) != indices(B)
        return false
    end
    for i in 1:length(A)
        if !(A[i] == B[i])
            return false
        end
    end
    return true
end
function myeq10(A::Vector, B::Vector)
    if length(B) != length(A)
        return false
    end
    for (a, b) in zip(A, B)
        if !(a == b)
            return false
        end
    end
    return true
end
function myeq11(A::Vector, B::Vector)
    if length(B) != length(A)
        return false
    end
    for i in 1:length(A)
        if !(A[i] == B[i])
            return false
        end
    end
    return true
end
function myeqfast(A::Vector, B::Vector)
    len = length(A)
    if length(B) != len
        return false
    end
    for i in 1:len
        if !(A[i] == B[i])
            return false
        end
    end
    return true
end

f = [((==), "=="), (myeq00, "00"), (myeq01, "01"), (myeq10, "10"), (myeq11, "11"), (myeqfast, "fast")]

using BenchmarkTools
function st(x)
    string(BenchmarkTools.prettytime(time(x)), " (", BenchmarkTools.prettypercent(BenchmarkTools.gcratio(x)), ")")
end
function bench(v)
    for (eq, name) in f
        @assert eq(v, v)
        print("$name")
        b = @benchmark($eq($v, $v))
        println(" | $(st(minimum(b))) | $(st(median(b))) | $(st(mean(b))) | $(st(maximum(b)))")
    end
end

Then for bitstype, I get:

for n in [1, 10, 100, 1000]
    bench(ones(Int, n))
end

n=1:

eq min (GC) median (GC) mean (GC) max (GC)
== 8.620 ns (0.00%) 8.627 ns (0.00%) 8.898 ns (0.00%) 37.732 ns (0.00%)
00 9.192 ns (0.00%) 9.198 ns (0.00%) 9.341 ns (0.00%) 39.032 ns (0.00%)
01 8.046 ns (0.00%) 8.053 ns (0.00%) 8.216 ns (0.00%) 42.357 ns (0.00%)
10 4.320 ns (0.00%) 4.325 ns (0.00%) 4.343 ns (0.00%) 36.473 ns (0.00%)
11 4.606 ns (0.00%) 4.612 ns (0.00%) 4.648 ns (0.00%) 35.283 ns (0.00%)
fast 4.751 ns (0.00%) 4.898 ns (0.00%) 4.994 ns (0.00%) 18.439 ns (0.00%)

n=10:

eq min (GC) median (GC) mean (GC) max (GC)
== 21.618 ns (0.00%) 23.195 ns (0.00%) 24.380 ns (0.00%) 68.913 ns (0.00%)
00 26.562 ns (0.00%) 26.571 ns (0.00%) 27.298 ns (0.00%) 60.414 ns (0.00%)
01 18.356 ns (0.00%) 18.362 ns (0.00%) 18.469 ns (0.00%) 58.845 ns (0.00%)
10 12.706 ns (0.00%) 13.093 ns (0.00%) 13.248 ns (0.00%) 48.511 ns (0.00%)
11 12.998 ns (0.00%) 13.398 ns (0.00%) 14.616 ns (0.00%) 50.421 ns (0.00%)
fast 12.114 ns (0.00%) 12.120 ns (0.00%) 12.934 ns (0.00%) 47.351 ns (0.00%)

n=100:

eq min (GC) median (GC) mean (GC) max (GC)
== 162.715 ns (0.00%) 165.295 ns (0.00%) 170.022 ns (0.00%) 328.290 ns (0.00%)
00 172.337 ns (0.00%) 173.652 ns (0.00%) 175.903 ns (0.00%) 298.930 ns (0.00%)
01 128.699 ns (0.00%) 129.279 ns (0.00%) 131.219 ns (0.00%) 203.384 ns (0.00%)
10 95.300 ns (0.00%) 95.407 ns (0.00%) 102.318 ns (0.00%) 440.289 ns (0.00%)
11 95.399 ns (0.00%) 95.689 ns (0.00%) 97.432 ns (0.00%) 279.149 ns (0.00%)
fast 95.681 ns (0.00%) 101.482 ns (0.00%) 100.489 ns (0.00%) 246.303 ns (0.00%)

n=1000:

eq min (GC) median (GC) mean (GC) max (GC)
== 1.529 μs (0.00%) 1.546 μs (0.00%) 1.602 μs (0.00%) 7.085 μs (0.00%)
00 1.476 μs (0.00%) 1.480 μs (0.00%) 1.507 μs (0.00%) 2.905 μs (0.00%)
01 1.162 μs (0.00%) 1.164 μs (0.00%) 1.187 μs (0.00%) 2.830 μs (0.00%)
10 868.431 ns (0.00%) 870.000 ns (0.00%) 884.282 ns (0.00%) 1.329 μs (0.00%)
11 868.935 ns (0.00%) 894.696 ns (0.00%) 923.376 ns (0.00%) 1.768 μs (0.00%)
fast 868.789 ns (0.00%) 870.386 ns (0.00%) 895.572 ns (0.00%) 1.801 μs (0.00%)

And for non-bitstype I get

for n in [1, 10, 100, 1000]
    bench([A(i) for i in 1:n])
end

n=1:

eq min (GC) median (GC) mean (GC) max (GC)
== 42.595 ns (0.00%) 43.296 ns (0.00%) 46.102 ns (4.14%) 1.569 μs (95.55%)
00 41.592 ns (0.00%) 42.390 ns (0.00%) 45.328 ns (4.24%) 1.591 μs (95.95%)
01 36.266 ns (0.00%) 36.728 ns (0.00%) 37.230 ns (0.00%) 77.842 ns (0.00%)
10 39.514 ns (0.00%) 41.865 ns (0.00%) 44.119 ns (4.41%) 1.648 μs (95.56%)
11 32.403 ns (0.00%) 34.176 ns (0.00%) 34.544 ns (0.00%) 78.238 ns (0.00%)
fast 35.029 ns (0.00%) 37.749 ns (0.00%) 37.565 ns (0.00%) 72.412 ns (0.00%)

n=10:

eq min (GC) median (GC) mean (GC) max (GC)
== 372.005 ns (0.00%) 375.915 ns (0.00%) 403.080 ns (5.08%) 7.928 μs (92.97%)
00 350.661 ns (0.00%) 363.692 ns (0.00%) 390.254 ns (5.28%) 7.785 μs (94.13%)
01 302.255 ns (0.00%) 311.359 ns (0.00%) 311.790 ns (0.00%) 457.876 ns (0.00%)
10 345.910 ns (0.00%) 349.608 ns (0.00%) 379.283 ns (5.59%) 7.838 μs (90.72%)
11 300.080 ns (0.00%) 305.568 ns (0.00%) 311.262 ns (0.00%) 557.504 ns (0.00%)
fast 276.544 ns (0.00%) 281.350 ns (0.00%) 289.757 ns (0.00%) 502.327 ns (0.00%)

n=100:

eq min (GC) median (GC) mean (GC) max (GC)
== 3.317 μs (0.00%) 3.388 μs (0.00%) 3.698 μs (5.81%) 206.631 μs (96.86%)
00 3.189 μs (0.00%) 3.248 μs (0.00%) 3.561 μs (6.05%) 205.103 μs (96.81%)
01 2.584 μs (0.00%) 2.623 μs (0.00%) 2.739 μs (0.00%) 6.641 μs (0.00%)
10 3.224 μs (0.00%) 3.448 μs (0.00%) 3.793 μs (6.57%) 392.925 μs (97.98%)
11 2.902 μs (0.00%) 2.928 μs (0.00%) 2.951 μs (0.00%) 4.961 μs (0.00%)
fast 2.674 μs (0.00%) 2.716 μs (0.00%) 2.733 μs (0.00%) 4.440 μs (0.00%)

n=1000:

eq min (GC) median (GC) mean (GC) max (GC)
== 32.989 μs (0.00%) 33.487 μs (0.00%) 36.599 μs (6.12%) 2.060 ms (96.45%)
00 31.869 μs (0.00%) 33.983 μs (0.00%) 36.519 μs (6.10%) 1.762 ms (96.61%)
01 25.753 μs (0.00%) 26.009 μs (0.00%) 26.308 μs (0.00%) 44.678 μs (0.00%)
10 32.210 μs (0.00%) 32.771 μs (0.00%) 35.745 μs (6.19%) 1.676 ms (96.65%)
11 28.866 μs (0.00%) 29.174 μs (0.00%) 29.260 μs (0.00%) 88.830 μs (0.00%)
fast 26.699 μs (0.00%) 27.085 μs (0.00%) 27.217 μs (0.00%) 69.038 μs (0.00%)

We see that there is some GC time because of the zip since it is not present in 01, 11 and fast and this is exactly the tests without GC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants