Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eigen is 20% slower on 1.8-beta and nightly than 1.7 release #915

Closed
Roger-luo opened this issue Mar 4, 2022 · 10 comments · Fixed by JuliaLang/julia#45412
Closed

eigen is 20% slower on 1.8-beta and nightly than 1.7 release #915

Roger-luo opened this issue Mar 4, 2022 · 10 comments · Fixed by JuliaLang/julia#45412
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@Roger-luo
Copy link
Contributor

I'm not sure what has been changed, but it seems eigen has a performance regression, also didn't find other issue mentioning this

on 1.7.2

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)

julia> using LinearAlgebra, BenchmarkTools

julia> M = rand(1000, 1000);

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min  max):  767.383 ms  783.423 ms  ┊ GC (min  max): 0.00%  0.04%
 Time  (median):     774.151 ms               ┊ GC (median):    0.04%
 Time  (mean ± σ):   774.743 ms ±   5.491 ms  ┊ GC (mean ± σ):  0.02% ± 0.02%

  █           █    █       █    █                █            █
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  767 ms           Histogram: frequency by time          783 ms <

 Memory estimate: 31.58 MiB, allocs estimate: 21.

on 1.8-beta1

julia> versioninfo()
Julia Version 1.8.0-beta1
Commit 7b711ce699 (2022-02-23 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 24 virtual cores

julia> using LinearAlgebra, BenchmarkTools

julia> M = rand(1000, 1000);

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min  max):  1.067 s    1.109 s  ┊ GC (min  max): 0.04%  0.04%
 Time  (median):     1.095 s              ┊ GC (median):    0.04%
 Time  (mean ± σ):   1.089 s ± 17.117 ms  ┊ GC (mean ± σ):  0.02% ± 0.02%

  █         █                           █   █             █
  █▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.07 s         Histogram: frequency by time        1.11 s <

 Memory estimate: 31.58 MiB, allocs estimate: 21.

on master branch

julia> versioninfo()
Julia Version 1.9.0-DEV.118
Commit 15b5df4633 (2022-03-02 18:30 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 24 virtual cores

julia> using LinearAlgebra, BenchmarkTools

julia> M = rand(1000, 1000);

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min  max):  1.126 s    1.164 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.134 s              ┊ GC (median):    0.04%
 Time  (mean ± σ):   1.139 s ± 15.033 ms  ┊ GC (mean ± σ):  0.03% ± 0.02%

  █     █     █          █                                █
  █▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.13 s         Histogram: frequency by time        1.16 s <

 Memory estimate: 31.58 MiB, allocs estimate: 21.
@Roger-luo Roger-luo changed the title eigen is eigen is 20% slower on 1.8-beta and nightly than 1.7 release Mar 4, 2022
@dkarrasch dkarrasch added the regression Regression in behavior compared to a previous version label Mar 7, 2022
@maleadt
Copy link
Member

maleadt commented Mar 8, 2022

Bisected here (700 ms vs 400 ms) to: JuliaLang/julia#42442

65eff6da537b3eb0af2759ed678bd2aaefcb314f is the first bad commit
commit 65eff6da537b3eb0af2759ed678bd2aaefcb314f
Author: Viral B. Shah <[email protected]>
Date:   Sat Oct 2 07:48:39 2021 -0400

    Remove openblas set_num_threads in julia __init__ (#42442)

    * Remove openblas_set_num_threads in julia __init__

    * Remove test no longer needed.

@KristofferC
Copy link
Member

(@ViralBShah )

@ViralBShah ViralBShah added the performance Must go faster label Mar 10, 2022
@ViralBShah
Copy link
Member

BTW, I just tried on 1.7.2 and master, and I get the min time execution the same (on my 2017 macbook pro which is basically core i5). I'll try it on a few different machines.

@Roger-luo do you have any environment variables? That PR should really not have much to do with this, but you never know!

@KristofferC
Copy link
Member

KristofferC commented Mar 10, 2022

I can repro. 684 ms on 1.7 and 915 ms on nightly. Maybe you have some environment variables set :P.

@ViralBShah
Copy link
Member

That PR of mine just removes some environment variables that reduces thread numbers. Can you recover the performance if you set it?

@KristofferC
Copy link
Member

I recompiled with a flag that can enable/disable that env var but now I can't really see a difference with it set or not, not even vs 1.7.

@ViralBShah
Copy link
Member

ViralBShah commented Mar 11, 2022

I do notice that on master, if we reduce the number of openblas threads (using 8 that we used to do before), openblas does get faster. Using fewer (4 threads) takes longer, and the default setting is really slow. The issue is only on linux where openblas picks total hyperthreads as the number of available threads (#43692). Also, on this 32 core system, using 8-16 threads gives the best performance on this problem and everything else is slower.

I thought MKL may be better at picking the number of physical cores and setting threads, but I see similar performance issues with MKL as well (but this is an AMD Epyc system) with the default number of threads, and get the best performance at 8-16 threads.

Cc @chriselrod

@chriselrod
Copy link
Contributor

chriselrod commented Mar 11, 2022

The AMD Epyc system probably has scaling issues beyond 8 threads, given that it has split L3 caches.

I do however see the same performance on an Intel system with MKL whether using BLAS.set_num_threads(8) or BLAS.set_num_threads(14) threads.
Unfortunately, I don't know how many threads MKL is actually using.
I suspect it is using around 8. htop shows 1400% CPU usage even for a 100x100 matrix; I suspect all the extra threads are busy waiting in a spin lock.

OpenBLAS also performs about the same whether I use 8 threads or 14 on the Intel system. So maybe for a 1000x1000 matrix, they're at the point of approximately 0 return on extra threads (but also not being pessimized for extra threads).

I'm currently having trouble getting onto deepsea.

EDIT:
I do get better performance from 16 threads than 8 on the Epyc:

julia> using MKL

julia> M = rand(1_000, 1_000);

julia> BLAS.get_num_threads()
64

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libmkl_rt.so

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min  max):  1.269 s    1.365 s  ┊ GC (min  max): 0.00%  0.03%
 Time  (median):     1.314 s              ┊ GC (median):    0.02%
 Time  (mean ± σ):   1.315 s ± 46.606 ms  ┊ GC (mean ± σ):  0.02% ± 0.02%

  █       █                                   █           █
  █▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.27 s         Histogram: frequency by time        1.37 s <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> BLAS.set_num_threads(16)

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min  max):  1.015 s     1.476 s  ┊ GC (min  max): 0.00%  0.05%
 Time  (median):     1.108 s               ┊ GC (median):    0.05%
 Time  (mean ± σ):   1.216 s ± 224.492 ms  ┊ GC (mean ± σ):  0.03% ± 0.03%

  █  █       █                                         █   █
  █▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
  1.02 s         Histogram: frequency by time         1.48 s <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> BLAS.set_num_threads(8)

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min  max):  1.235 s    1.356 s  ┊ GC (min  max): 0.00%  0.04%
 Time  (median):     1.279 s              ┊ GC (median):    0.02%
 Time  (mean ± σ):   1.287 s ± 61.936 ms  ┊ GC (mean ± σ):  0.11% ± 0.18%

  █                                        ▁              ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.23 s         Histogram: frequency by time        1.36 s <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> versioninfo()
Julia Version 1.9.0-DEV.167
Commit f5d15571b3 (2022-03-11 17:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor

Still a lot worse than a 14 core Intel system:

julia> using MKL

julia> BLAS.get_num_threads()
14

julia> M = rand(1000, 1000);

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min  max):  421.436 ms  429.722 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     426.406 ms               ┊ GC (median):    0.04%
 Time  (mean ± σ):   425.797 ms ±   2.595 ms  ┊ GC (mean ± σ):  0.18% ± 0.36%

  ▁   ▁    ▁              ▁         ▁ ▁█ ▁     ▁         ▁    ▁
  █▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁██▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁█ ▁
  421 ms           Histogram: frequency by time          430 ms <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min  max):  409.848 ms  422.491 ms  ┊ GC (min  max): 0.00%  0.11%
 Time  (median):     413.781 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   414.762 ms ±   3.597 ms  ┊ GC (mean ± σ):  0.10% ± 0.21%

  █    █   █   █ █ ██     ██     █       █       █            █
  █▁▁▁▁█▁▁▁█▁▁▁█▁█▁██▁▁▁▁▁██▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  410 ms           Histogram: frequency by time          422 ms <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> versioninfo()
Julia Version 1.9.0-DEV.151
Commit b7b46afcf2* (2022-03-07 22:01 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 4 on 28 virtual cores

Although this is roughly in line with AVX512 vs AVX2.

@dmicsa
Copy link

dmicsa commented May 9, 2022

I have a suggestion. Try to create one or more benchmark(s) and cache the results with the optimal set_num_threads for the local machine for various cases and use that number(s) by default.

I use this strategy at runtime in C++ for both CPU and GPU and works reasonable.

@JeffBezanson
Copy link
Member

Maybe the right thing is to let you set any number explicitly as JULIA_NUM_THREADS, but if you don't specify then cap at 8 by default like we did before? That way you can at least ask for 16 cores via the env var if you want to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants