eigen is 20% slower on 1.8-beta and nightly than 1.7 release #915

Roger-luo · 2022-03-04T14:20:39Z

I'm not sure what has been changed, but it seems eigen has a performance regression, also didn't find other issue mentioning this

on 1.7.2

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)

julia> using LinearAlgebra, BenchmarkTools

julia> M = rand(1000, 1000);

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  767.383 ms … 783.423 ms  ┊ GC (min … max): 0.00% … 0.04%
 Time  (median):     774.151 ms               ┊ GC (median):    0.04%
 Time  (mean ± σ):   774.743 ms ±   5.491 ms  ┊ GC (mean ± σ):  0.02% ± 0.02%

  █           █    █       █    █                █            █
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  767 ms           Histogram: frequency by time          783 ms <

 Memory estimate: 31.58 MiB, allocs estimate: 21.

on 1.8-beta1

julia> versioninfo()
Julia Version 1.8.0-beta1
Commit 7b711ce699 (2022-02-23 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 24 virtual cores

julia> using LinearAlgebra, BenchmarkTools

julia> M = rand(1000, 1000);

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.067 s …   1.109 s  ┊ GC (min … max): 0.04% … 0.04%
 Time  (median):     1.095 s              ┊ GC (median):    0.04%
 Time  (mean ± σ):   1.089 s ± 17.117 ms  ┊ GC (mean ± σ):  0.02% ± 0.02%

  █         █                           █   █             █
  █▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.07 s         Histogram: frequency by time        1.11 s <

 Memory estimate: 31.58 MiB, allocs estimate: 21.

on master branch

julia> versioninfo()
Julia Version 1.9.0-DEV.118
Commit 15b5df4633 (2022-03-02 18:30 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 24 virtual cores

julia> using LinearAlgebra, BenchmarkTools

julia> M = rand(1000, 1000);

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.126 s …   1.164 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.134 s              ┊ GC (median):    0.04%
 Time  (mean ± σ):   1.139 s ± 15.033 ms  ┊ GC (mean ± σ):  0.03% ± 0.02%

  █     █     █          █                                █
  █▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.13 s         Histogram: frequency by time        1.16 s <

 Memory estimate: 31.58 MiB, allocs estimate: 21.

The text was updated successfully, but these errors were encountered:

maleadt · 2022-03-08T08:54:54Z

Bisected here (700 ms vs 400 ms) to: JuliaLang/julia#42442

65eff6da537b3eb0af2759ed678bd2aaefcb314f is the first bad commit
commit 65eff6da537b3eb0af2759ed678bd2aaefcb314f
Author: Viral B. Shah <[email protected]>
Date:   Sat Oct 2 07:48:39 2021 -0400

    Remove openblas set_num_threads in julia __init__ (#42442)

    * Remove openblas_set_num_threads in julia __init__

    * Remove test no longer needed.

KristofferC · 2022-03-10T08:32:51Z

(@ViralBShah )

ViralBShah · 2022-03-10T14:50:31Z

BTW, I just tried on 1.7.2 and master, and I get the min time execution the same (on my 2017 macbook pro which is basically core i5). I'll try it on a few different machines.

@Roger-luo do you have any environment variables? That PR should really not have much to do with this, but you never know!

KristofferC · 2022-03-10T15:07:37Z

I can repro. 684 ms on 1.7 and 915 ms on nightly. Maybe you have some environment variables set :P.

ViralBShah · 2022-03-10T15:14:18Z

That PR of mine just removes some environment variables that reduces thread numbers. Can you recover the performance if you set it?

KristofferC · 2022-03-10T15:39:39Z

I recompiled with a flag that can enable/disable that env var but now I can't really see a difference with it set or not, not even vs 1.7.

ViralBShah · 2022-03-11T03:41:08Z

I do notice that on master, if we reduce the number of openblas threads (using 8 that we used to do before), openblas does get faster. Using fewer (4 threads) takes longer, and the default setting is really slow. The issue is only on linux where openblas picks total hyperthreads as the number of available threads (#43692). Also, on this 32 core system, using 8-16 threads gives the best performance on this problem and everything else is slower.

I thought MKL may be better at picking the number of physical cores and setting threads, but I see similar performance issues with MKL as well (but this is an AMD Epyc system) with the default number of threads, and get the best performance at 8-16 threads.

Cc @chriselrod

chriselrod · 2022-03-11T20:19:48Z

The AMD Epyc system probably has scaling issues beyond 8 threads, given that it has split L3 caches.

I do however see the same performance on an Intel system with MKL whether using BLAS.set_num_threads(8) or BLAS.set_num_threads(14) threads.
Unfortunately, I don't know how many threads MKL is actually using.
I suspect it is using around 8. htop shows 1400% CPU usage even for a 100x100 matrix; I suspect all the extra threads are busy waiting in a spin lock.

OpenBLAS also performs about the same whether I use 8 threads or 14 on the Intel system. So maybe for a 1000x1000 matrix, they're at the point of approximately 0 return on extra threads (but also not being pessimized for extra threads).

I'm currently having trouble getting onto deepsea.

EDIT:
I do get better performance from 16 threads than 8 on the Epyc:

julia> using MKL

julia> M = rand(1_000, 1_000);

julia> BLAS.get_num_threads()
64

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libmkl_rt.so

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.269 s …   1.365 s  ┊ GC (min … max): 0.00% … 0.03%
 Time  (median):     1.314 s              ┊ GC (median):    0.02%
 Time  (mean ± σ):   1.315 s ± 46.606 ms  ┊ GC (mean ± σ):  0.02% ± 0.02%

  █       █                                   █           █
  █▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.27 s         Histogram: frequency by time        1.37 s <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> BLAS.set_num_threads(16)

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.015 s …    1.476 s  ┊ GC (min … max): 0.00% … 0.05%
 Time  (median):     1.108 s               ┊ GC (median):    0.05%
 Time  (mean ± σ):   1.216 s ± 224.492 ms  ┊ GC (mean ± σ):  0.03% ± 0.03%

  █  █       █                                         █   █
  █▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
  1.02 s         Histogram: frequency by time         1.48 s <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> BLAS.set_num_threads(8)

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.235 s …   1.356 s  ┊ GC (min … max): 0.00% … 0.04%
 Time  (median):     1.279 s              ┊ GC (median):    0.02%
 Time  (mean ± σ):   1.287 s ± 61.936 ms  ┊ GC (mean ± σ):  0.11% ± 0.18%

  █                                        ▁              ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.23 s         Histogram: frequency by time        1.36 s <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> versioninfo()
Julia Version 1.9.0-DEV.167
Commit f5d15571b3 (2022-03-11 17:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor

Still a lot worse than a 14 core Intel system:

julia> using MKL

julia> BLAS.get_num_threads()
14

julia> M = rand(1000, 1000);

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  421.436 ms … 429.722 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     426.406 ms               ┊ GC (median):    0.04%
 Time  (mean ± σ):   425.797 ms ±   2.595 ms  ┊ GC (mean ± σ):  0.18% ± 0.36%

  ▁   ▁    ▁              ▁         ▁ ▁█ ▁     ▁         ▁    ▁
  █▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁██▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁█ ▁
  421 ms           Histogram: frequency by time          430 ms <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> @benchmark eigen($M)
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  409.848 ms … 422.491 ms  ┊ GC (min … max): 0.00% … 0.11%
 Time  (median):     413.781 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   414.762 ms ±   3.597 ms  ┊ GC (mean ± σ):  0.10% ± 0.21%

  █    █   █   █ █ ██     ██     █       █       █            █
  █▁▁▁▁█▁▁▁█▁▁▁█▁█▁██▁▁▁▁▁██▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  410 ms           Histogram: frequency by time          422 ms <

 Memory estimate: 33.53 MiB, allocs estimate: 21.

julia> versioninfo()
Julia Version 1.9.0-DEV.151
Commit b7b46afcf2* (2022-03-07 22:01 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 4 on 28 virtual cores

Although this is roughly in line with AVX512 vs AVX2.

dmicsa · 2022-05-09T14:45:05Z

I have a suggestion. Try to create one or more benchmark(s) and cache the results with the optimal set_num_threads for the local machine for various cases and use that number(s) by default.

I use this strategy at runtime in C++ for both CPU and GPU and works reasonable.

JeffBezanson · 2022-05-17T16:01:44Z

Maybe the right thing is to let you set any number explicitly as JULIA_NUM_THREADS, but if you don't specify then cap at 8 by default like we did before? That way you can at least ask for 16 cores via the env var if you want to.

Roger-luo changed the title ~~eigen is~~ eigen is 20% slower on 1.8-beta and nightly than 1.7 release Mar 4, 2022

dkarrasch added the regression Regression in behavior compared to a previous version label Mar 7, 2022

ViralBShah added the performance Must go faster label Mar 10, 2022

Moelf mentioned this issue May 21, 2022

set default blas num threads to Sys.CPU_THREADS / 2 in absence of OPENBLAS_NUM_THREADS JuliaLang/julia#45412

Merged

ViralBShah closed this as completed in JuliaLang/julia#45412 May 22, 2022

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eigen is 20% slower on 1.8-beta and nightly than 1.7 release #915

eigen is 20% slower on 1.8-beta and nightly than 1.7 release #915

Roger-luo commented Mar 4, 2022

maleadt commented Mar 8, 2022 •

edited by ViralBShah

Loading

KristofferC commented Mar 10, 2022

ViralBShah commented Mar 10, 2022

KristofferC commented Mar 10, 2022 •

edited

Loading

ViralBShah commented Mar 10, 2022

KristofferC commented Mar 10, 2022

ViralBShah commented Mar 11, 2022 •

edited

Loading

chriselrod commented Mar 11, 2022 •

edited

Loading

dmicsa commented May 9, 2022 •

edited

Loading

JeffBezanson commented May 17, 2022

eigen is 20% slower on 1.8-beta and nightly than 1.7 release #915

eigen is 20% slower on 1.8-beta and nightly than 1.7 release #915

Comments

Roger-luo commented Mar 4, 2022

maleadt commented Mar 8, 2022 • edited by ViralBShah Loading

KristofferC commented Mar 10, 2022

ViralBShah commented Mar 10, 2022

KristofferC commented Mar 10, 2022 • edited Loading

ViralBShah commented Mar 10, 2022

KristofferC commented Mar 10, 2022

ViralBShah commented Mar 11, 2022 • edited Loading

chriselrod commented Mar 11, 2022 • edited Loading

dmicsa commented May 9, 2022 • edited Loading

JeffBezanson commented May 17, 2022

maleadt commented Mar 8, 2022 •

edited by ViralBShah

Loading

KristofferC commented Mar 10, 2022 •

edited

Loading

ViralBShah commented Mar 11, 2022 •

edited

Loading

chriselrod commented Mar 11, 2022 •

edited

Loading

dmicsa commented May 9, 2022 •

edited

Loading