-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eigen is 20% slower on 1.8-beta and nightly than 1.7 release #915
Comments
Bisected here (700 ms vs 400 ms) to: JuliaLang/julia#42442
|
(@ViralBShah ) |
BTW, I just tried on 1.7.2 and master, and I get the min time execution the same (on my 2017 macbook pro which is basically core i5). I'll try it on a few different machines. @Roger-luo do you have any environment variables? That PR should really not have much to do with this, but you never know! |
I can repro. 684 ms on 1.7 and 915 ms on nightly. Maybe you have some environment variables set :P. |
That PR of mine just removes some environment variables that reduces thread numbers. Can you recover the performance if you set it? |
I recompiled with a flag that can enable/disable that env var but now I can't really see a difference with it set or not, not even vs 1.7. |
I do notice that on master, if we reduce the number of openblas threads (using 8 that we used to do before), openblas does get faster. Using fewer (4 threads) takes longer, and the default setting is really slow. The issue is only on linux where openblas picks total hyperthreads as the number of available threads (#43692). Also, on this 32 core system, using 8-16 threads gives the best performance on this problem and everything else is slower. I thought MKL may be better at picking the number of physical cores and setting threads, but I see similar performance issues with MKL as well (but this is an AMD Epyc system) with the default number of threads, and get the best performance at 8-16 threads. Cc @chriselrod |
The AMD Epyc system probably has scaling issues beyond 8 threads, given that it has split L3 caches. I do however see the same performance on an Intel system with MKL whether using OpenBLAS also performs about the same whether I use 8 threads or 14 on the Intel system. So maybe for a 1000x1000 matrix, they're at the point of approximately 0 return on extra threads (but also not being pessimized for extra threads). I'm currently having trouble getting onto deepsea. EDIT: julia> using MKL
julia> M = rand(1_000, 1_000);
julia> BLAS.get_num_threads()
64
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libmkl_rt.so
julia> @benchmark eigen($M)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.269 s … 1.365 s ┊ GC (min … max): 0.00% … 0.03%
Time (median): 1.314 s ┊ GC (median): 0.02%
Time (mean ± σ): 1.315 s ± 46.606 ms ┊ GC (mean ± σ): 0.02% ± 0.02%
█ █ █ █
█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.27 s Histogram: frequency by time 1.37 s <
Memory estimate: 33.53 MiB, allocs estimate: 21.
julia> BLAS.set_num_threads(16)
julia> @benchmark eigen($M)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
Range (min … max): 1.015 s … 1.476 s ┊ GC (min … max): 0.00% … 0.05%
Time (median): 1.108 s ┊ GC (median): 0.05%
Time (mean ± σ): 1.216 s ± 224.492 ms ┊ GC (mean ± σ): 0.03% ± 0.03%
█ █ █ █ █
█▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
1.02 s Histogram: frequency by time 1.48 s <
Memory estimate: 33.53 MiB, allocs estimate: 21.
julia> BLAS.set_num_threads(8)
julia> @benchmark eigen($M)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.235 s … 1.356 s ┊ GC (min … max): 0.00% … 0.04%
Time (median): 1.279 s ┊ GC (median): 0.02%
Time (mean ± σ): 1.287 s ± 61.936 ms ┊ GC (mean ± σ): 0.11% ± 0.18%
█ ▁ ▁
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.23 s Histogram: frequency by time 1.36 s <
Memory estimate: 33.53 MiB, allocs estimate: 21.
julia> versioninfo()
Julia Version 1.9.0-DEV.167
Commit f5d15571b3 (2022-03-11 17:10 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × AMD EPYC 7513 32-Core Processor Still a lot worse than a 14 core Intel system: julia> using MKL
julia> BLAS.get_num_threads()
14
julia> M = rand(1000, 1000);
julia> @benchmark eigen($M)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 421.436 ms … 429.722 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 426.406 ms ┊ GC (median): 0.04%
Time (mean ± σ): 425.797 ms ± 2.595 ms ┊ GC (mean ± σ): 0.18% ± 0.36%
▁ ▁ ▁ ▁ ▁ ▁█ ▁ ▁ ▁ ▁
█▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁██▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁█ ▁
421 ms Histogram: frequency by time 430 ms <
Memory estimate: 33.53 MiB, allocs estimate: 21.
julia> @benchmark eigen($M)
BenchmarkTools.Trial: 13 samples with 1 evaluation.
Range (min … max): 409.848 ms … 422.491 ms ┊ GC (min … max): 0.00% … 0.11%
Time (median): 413.781 ms ┊ GC (median): 0.00%
Time (mean ± σ): 414.762 ms ± 3.597 ms ┊ GC (mean ± σ): 0.10% ± 0.21%
█ █ █ █ █ ██ ██ █ █ █ █
█▁▁▁▁█▁▁▁█▁▁▁█▁█▁██▁▁▁▁▁██▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
410 ms Histogram: frequency by time 422 ms <
Memory estimate: 33.53 MiB, allocs estimate: 21.
julia> versioninfo()
Julia Version 1.9.0-DEV.151
Commit b7b46afcf2* (2022-03-07 22:01 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
Threads: 4 on 28 virtual cores Although this is roughly in line with AVX512 vs AVX2. |
I have a suggestion. Try to create one or more benchmark(s) and cache the results with the optimal set_num_threads for the local machine for various cases and use that number(s) by default. I use this strategy at runtime in C++ for both CPU and GPU and works reasonable. |
Maybe the right thing is to let you set any number explicitly as JULIA_NUM_THREADS, but if you don't specify then cap at 8 by default like we did before? That way you can at least ask for 16 cores via the env var if you want to. |
I'm not sure what has been changed, but it seems
eigen
has a performance regression, also didn't find other issue mentioning thison 1.7.2
on 1.8-beta1
on master branch
The text was updated successfully, but these errors were encountered: