Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in mkl_avx2.1.dll after a partial QR decomposition from LowRankApprox.jl ? #75

Closed
BambOoxX opened this issue Apr 20, 2021 · 10 comments

Comments

@BambOoxX
Copy link

BambOoxX commented Apr 20, 2021

Note : related to this Discource thread and JuliaLinearAlgebra/LowRankApprox.jl#35

After performing a partial QR decomposition from LowRankApprox I get a julia crash throwing this error on julia 1.5.4 with MKL.jl installed
2021-04-19 18_03_33-Window

The error disappears with julia 1.6.0 using OpenBLAS.

I could not extract an MWE showing the problem from my original code, but the errror seems to show up as soon as I try to access the factors of the partial QR.

I open this issue hoping for some pointers, as it doesn't seem to stem from LowRankApprox and is maybe a bit too specific for the Discource forum.

Note that the conventional QR works fine with ohterwise the same code however.

@jarlebring
Copy link

jarlebring commented Apr 23, 2021

If you provide an MWE you are more likely to get help. Have you eliminated Julia version differences? Here is some start of a problem search.

If it happens during Q-access, you may want to first trace the code

julia> QRfact=qr(randn(5,4));
julia> @which(QRfact.Q)
getproperty(F::LinearAlgebra.QRCompactWY, d::Symbol) in LinearAlgebra at /home/jarl/archive/src/julia_latest_1.7/julia-29d5158d27/share/julia/stdlib/v1.7/LinearAlgebra/src/qr.jl:432

and then https://github.com/JuliaLang/julia/blob/fc02458492c60f6527245c6991f729c2a986f666/stdlib/LinearAlgebra/src/qr.jl#L437

Based on the stacktrace you may also want to look for where ztpsv (triangular system solve) is used . If the problem is due to memory allocation / deallocation (likely due to the "access violation" message), the problem is probably earlier than that.

@BambOoxX
Copy link
Author

BambOoxX commented Apr 23, 2021

Well, thanks for the attention, but

  1. As indicated, while I surely tried my best, I could not isolate an MWE from my original code
  2. The problem does not show when using a standard QR decomposition, so I fail to see how studying the behavior with qr will help.

After subsequent research, though I cannot be entirely sure, this issue may occur due to a non-catched lapack error, so probably non specific to MKL.
This issue was posted here because the only obvious error message is related to MKL.

@jarlebring
Copy link

jarlebring commented Apr 24, 2021

You need to provide more details in any case. What julia-function call exactly is generating this error message? What happens when you switch julia version? What do you mean by conventional QR? The julia function qr makes a blas call, so it also ends up in MKL.

In my experience this package works best on julia 1.7 - so nightly build.

@jarlebring
Copy link

Considering the error message is in ztpsv, I would start looking at ldiv!-calls with triangular matrices in LowRankApprox.jl assuming you get the error when calling a function from that package.

@BambOoxX
Copy link
Author

You need to provide more details in any case. What julia-function call exactly is generating this error message? What happens when you switch julia version? What do you mean by conventional QR? The julia function qr makes a blas call, so it also ends up in MKL.

In my experience this package works best on julia 1.7 - so nightly build.

I am sorry to repeat myself, but as I said, I could not isolate more precisely this issue. When this appears, I only gets like 0.5 second to react before the terminal crashes with only this message printed on screen, before it closes. I will not invent information I do not have.
Running different versions do not change the issue either with 1.54 1.6.0 or 1.6.1.
"Conventional qr" is meant to clarify from the "partial qr" exposed in LowRankApprox.

@jarlebring
Copy link

jarlebring commented May 20, 2021

That's a tough debugging environment. What command do you use to access the factors? I don't see why that would not be available even in your restricted setting. You seem to have been able to avoid it by replacing some code with "conventional QR". Which code is replaced?

Edit: I now saw you have provided more info on that in the discourse forum...

@jarlebring
Copy link

I know you said you tried your best to zoom in on the problem. This is how I would proceed. I would edit ~/.julia/packages/LowRankApprox/2wpw4/src/pqr.jl and add code to write a subset of parameters to file before and after every LAPACK-call.

@ViralBShah
Copy link
Contributor

This package just replaces Julia's openblas with MKL under the hood. We have no way to address any issues in MKL ourselves - those need to be reported directly to Intel.

Perhaps some of this discussion and troubleshooting is better suited for Discourse.

@BambOoxX
Copy link
Author

@ViralBShah yeah, sorry for that. I was a bit lost with this issue, and I opened it during my first days of julia. In the end and with some more insight into it, it seems it is not specific to MKL so it is better to close this issue. Thanks for your infos.

@ViralBShah
Copy link
Contributor

No worries at all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants