-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing the default LU? #357
Comments
OpenBLAS looks drunk. |
openBLAS is just multithreading way too small. RLFU looks like a good option. Is the (pre)compile time impact reasonable? |
A Mac version: using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5
function luflop(m, n = m; innerflop = 2)
sum(1:min(m, n)) do k
invflop = 1
scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
updateflop = isempty((k + 1):n) ? 0 :
sum((k + 1):n) do j
isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
innerflop
end
end
invflop + scaleflop + updateflop
end
end
algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), AppleAccelerateLUFactorization(), FastLUFactorization(), SimpleLUFactorization()]
res = [Float64[] for i in 1:length(algs)]
ns = 4:8:500
for i in 1:length(ns)
n = ns[i]
@info "$n × $n"
rng = MersenneTwister(123)
global A = rand(rng, n, n)
global b = rand(rng, n)
global u0= rand(rng, n)
for j in 1:length(algs)
bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
push!(res[j], luflop(n) / bt / 1e9)
end
end
using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)
p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p
savefig("lubench.png")
savefig("lubench.pdf")
|
It's interesting that RFLU does poorly on mac. Does it not know about the lower vectorization width? |
Ideally you should have a |
Running the Mac specific version from above comment julia +beta --startup-file=no --proj
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.10.0-beta1 (2023-07-25)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
(tmp) pkg> st
Status `~/.julia/dev/tmp/Project.toml`
[13e28ba4] AppleAccelerate v0.4.0
[6e4b80f9] BenchmarkTools v1.3.2
[7ed4a6bd] LinearSolve v2.5.0
[dde4c033] Metal v0.5.0
[91a5bcdd] Plots v1.38.17
[3d5dd08c] VectorizationBase v0.21.64
julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M2
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 on 4 virtual cores
Environment:
JULIA_NUM_PRECOMPILE_TASKS = 8
JULIA_DEPOT_PATH = /Users/vp/.julia
JULIA_PKG_DEVDIR = /Users/vp/.julia/dev
[lubench.pdf](https://github.com/SciML/LinearSolve.jl/files/12 |
I wonder if that's a trend of Intel vs AMD. Need more data. |
Here is:
Might make sense to remove the two lowest performing methods here to speed up benchmarks given they probably won't be chosen? |
yeah maybe though it doesn't change the plot much and confirms at what point the BLAS matters |
I'm pretty sure it's number of threads. Openblas generally is pretty bad at ramping up the number of threads as size increases and often just goes straight from 1 to full multi-threading. As such on CPUs with lots of cores it performs incredibly badly in the region where it should be using 2-4 cores and is instead using 16. |
But that gives no explanation to what was actually mentioned, which has no OpenBLAS but is RecursiveFactorization vs MKL and where the cutoff is. From the looks so far, I'd say:
and never doing OpenBLAS. |
When I tried to run this the code errored on the 348 x 348 problem size (twice). Not sure what's up with that since I made a clean tmp environment. Could just be me but thought I'd share.
|
that's interesting. There are now two of us that have a fairly noticeable performance bump in the 350 to 380 region for RFLU. Any ideas as to what could be causing it? It looks like we aren't using more threads soon enough maybe? |
Considering mine crashes in the RFLU at 350, something definitely going on there. |
Multithreaded and singlethreaded are going to look very different. MKL does very well multithreaded, while OpenBLAS is awful (as has been discussed). RF does not scale well with multiple threads. That is a known issue, but Yingbo and I never had time to address it. |
@ejmeitz, you aren't doing something weird like not using precompiled modules, are you? |
When I added the packages it spammed the message below. I started from a clean env so I just kind of ignored the messages, but that is probably the issue. I did run
|
Nuke your precompile cache and try again. $ rm -rf ~/.julia/compiled/ |
That fixed it, thanks! RFLU seems better on my machine for longer than some of the others.
|
@ejmeitz your GFLOPs are generally very low, getting stomped by much cheaper CPUs. Maybe it's the clock rate? |
Likely, because of the poor multithreaded scaling. The big winner here is the Ryzen 7800X3D. It probably wants more multithreading to kick in at a much smaller size. |
I noticed that too. I also thought it would be the clocks (max turbo is 4 GHz) but it still felt low to me. Probably a combo of poor multithread scaling and the clocks being low. I can run it on 128 core AMD CPU if you'd be curious to see that data. |
Definitely curious now. |
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 16 on 16 virtual cores |
Using single julia thread julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores Using julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 11 on 8 virtual cores |
single thread julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 8 on 8 virtual cores |
@ChrisRackauckas Where can I find out what each of the factorizations does? |
I think we may want @static if Sys.ARCH === :x86_64
const libMKL = MKL_jll.libmkl_rt # more convenient name
function mkl_set_num_threads(N::Integer)
ccall((:MKL_Set_Num_Threads,libMKL), Cvoid, (Int32,), N % Int32)
end
mkl_set_num_threads(Threads.nthreads())
end for single threaded comparisons. Single threaded performance where we actually restrict the solves to a single thread would likely be useful for ensemble solves. |
@ViralBShah just the standard docs https://docs.sciml.ai/LinearSolve/stable/solvers/solvers/. Accelerate isn't in there yet though. I'll PR with Metal.jl as an option too when I land. |
Note that the way to get/set MKL threads should be through the domain API: JuliaLinearAlgebra/libblastrampoline#74 LBT doesn't do that but should be easy to do it here. |
Okay, so we should use something like using MKL_jll
mkl_blas_set_num_threads(numthreads::Int) =
Bool(ccall((:MKL_Domain_Set_Num_Threads, MKL_jll.libmkl_rt),
Cuint, (Cint,Cint), numthreads, 1)) Or, more elaborately using MKL_jll
mkl_set_num_threads(numthreads::Int, domain::Cint = zero(Cint)) =
Bool(ccall((:MKL_Domain_Set_Num_Threads, MKL_jll.libmkl_rt),
Cuint, (Cint,Cint), numthreads, domain))
const MKL_DOMAIN_ALL = Cint(0)
const MKL_DOMAIN_BLAS = Cint(1)
const MKL_DOMAIN_FFT = Cint(2)
const MKL_DOMAIN_VML = Cint(3)
const MKL_DOMAIN_PARDISO = Cint(4)
mkl_set_num_blas_threads(numthreads) =
mkl_set_num_threads(numthreads, MKL_DOMAIN_BLAS)
mkl_set_num_fft_threads(numthreads) =
mkl_set_num_threads(numthreads, MKL_DOMAIN_FFT)
mkl_set_num_vml_threads(numthreads) =
mkl_set_num_threads(numthreads, MKL_DOMAIN_VML)
mkl_set_num_pardiso_threads(numthreads) =
mkl_set_num_threads(numthreads, MKL_DOMAIN_PARDISO) |
Note that RFLU did not benefit from multithreading until like 450x450. It was hurt below that. |
I setup Metal.jl: using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, Metal
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5
function luflop(m, n = m; innerflop = 2)
sum(1:min(m, n)) do k
invflop = 1
scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
updateflop = isempty((k + 1):n) ? 0 :
sum((k + 1):n) do j
isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
innerflop
end
end
invflop + scaleflop + updateflop
end
end
algs = [AppleAccelerateLUFactorization(), MetalLUFactorization()]
res = [Float32[] for i in 1:length(algs)]
ns = 200:600:15000
for i in 1:length(ns)
n = ns[i]
@info "$n × $n"
rng = MersenneTwister(123)
global A = rand(rng, Float32, n, n)
global b = rand(rng, Float32, n)
global u0= rand(rng, Float32, n)
for j in 1:length(algs)
bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
GC.gc()
push!(res[j], luflop(n) / bt / 1e9)
end
end
using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)
p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p
savefig("metal_large_lubench.png")
savefig("metal_large_lubench.pdf") |
Can I get some results of folks doing CUDA offloading with the following script? using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, CUDA, MKL_jll
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5
function luflop(m, n = m; innerflop = 2)
sum(1:min(m, n)) do k
invflop = 1
scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
updateflop = isempty((k + 1):n) ? 0 :
sum((k + 1):n) do j
isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
innerflop
end
end
invflop + scaleflop + updateflop
end
end
algs = [MKLLUFactorization(), CUDAOffloadFactorization()]
res = [Float32[] for i in 1:length(algs)]
ns = 200:400:10000
for i in 1:length(ns)
n = ns[i]
@info "$n × $n"
rng = MersenneTwister(123)
global A = rand(rng, Float32, n, n)
global b = rand(rng, Float32, n)
global u0= rand(rng, Float32, n)
for j in 1:length(algs)
bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
push!(res[j], luflop(n) / bt / 1e9)
end
end
using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)
p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p
savefig("cudaoffloadlubench.png")
savefig("cudaoffloadlubench.pdf") |
julia> algs = [MKLLUFactorization(), CUDAOffloadFactorization()] (@v1.9) pkg> st CUDA Is this CUDA.jl from the repo tip? |
It should load when you do |
@joelandman @ChrisRackauckas There's a capitalization typo in the benchmark it should be |
[ Info: 200 × 200
ERROR: MethodError: no method matching getrf!(::Matrix{Float32}; ipiv::Vector{Int64}, info::Base.RefValue{Int64})
Closest candidates are:
getrf!(::AbstractMatrix{<:Float64}; ipiv, info, check)
@ LinearSolveMKLExt ~/.julia/packages/LinearSolve/Tcmzb/ext/LinearSolveMKLExt.jl:13 |
I can run the snippet on A100 and A40. However, I get ERROR: LoadError: UndefVarError: `MKLLUFactorization` not defined
Stacktrace:
[1] top-level scope
@ /scratch/pc2-mitarbeiter/bauerc/playground/linearsolvetest/script.jl:21
[2] include(fname::String)
@ Base.MainInclude ./client.jl:478
[3] top-level scope
@ REPL[1]:1
in expression starting at /scratch/pc2-mitarbeiter/bauerc/playground/linearsolvetest/script.jl:21 Update: With LinearSolve#main I get the same error as @chriselrod above: julia> include("script.jl")
[ Info: 200 × 200
ERROR: LoadError: MethodError: no method matching getrf!(::Matrix{Float32}; ipiv::Vector{Int64}, info::Base.RefValue{Int64})
Closest candidates are:
getrf!(::AbstractMatrix{<:Float64}; ipiv, info, check)
@ LinearSolveMKLExt /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:13
Stacktrace:
[1] #solve!#2
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:45 [inlined]
[2] solve!
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:39 [inlined]
[3] #solve!#6
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:197 [inlined]
[4] solve!
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:196 [inlined]
[5] #solve#5
@ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:193 [inlined]
[6] solve
[...] |
I threw together a quick patch for LinearSolverMKLExt.jl to accomodate the Float32 version. @ChrisRackauckas please let me know if you want a PR or a patch for it. Results incoming (running now) |
I put an MKL 32-bit patch into the MKL PR #361. I noticed that it's not using the MKL backsolve so that could potentially make that a bit faster, but it shouldn't effect the CUDA cutoff point. |
joe@zap:~ $ nvidia-smi +-----------------------------------------------------------------------------+ |
Same zen2 laptop julia> versioninfo() |
RTX 3090 + 5950xjulia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 1 on 32 virtual cores
A6000 ADA + EPYC 7713julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 256 × AMD EPYC 7713 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 1 on 256 virtual cores
|
i9-13900K, RTX 4090Versioninfo: Details``` julia> versioninfo() Julia Version 1.9.2 Commit e4ee485e909 (2023-07-05 09:39 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, goldmont) Threads: 1 on 32 virtual cores ``` ``` julia> CUDA.versioninfo() CUDA runtime 12.1, artifact installation CUDA driver 12.2 NVIDIA driver 535.86.5 CUDA libraries:
Julia packages:
Toolchain:
1 device:
|
Another 8% improvement in RF coming up: |
It looks like GPU offloading doesn't make sense once things are using MKL. Cutoff is >1000 |
For what it's worth, a similar system to the given example but a generation older and apparently running fewer threads (16 vs 32).
|
Thanks everyone, the defaults take these into account. Now MKL is defaulted to in many scenarios (along with AppleAccelerate on macs), with a switch at 200 which seems to be a roughly optimal spot to go from RFLU to MKL |
This is a thread for investigating changes to the LU defaults, based off of benchmarks like #356 .
(Note: there's a Mac-specific version 3 posts down)
lubench.pdf
The justification for RecursiveFactorization.jl still looks very strong from the looks of this.
Needs examples on other systems.
The text was updated successfully, but these errors were encountered: