-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Segmentation fault: 11 #17043
Comments
@tranvanhoa533 Facing the same problem, did you solve it? |
You can build MXNet from source with debug info enabled. Then the backtrace above will be more meaningful |
Seeing another segfault here:
|
|
Testing
|
This doesn't seem to happen when I don't link with llvm openmp in 3rdparty. Happens aprox 50% of the time with this test. |
Looks like it might be related to mixing openmp implementations even though there were several fixes recently for related issues like #14979 |
@larroy the issue you report is due to an incompatibility between the jemalloc version used and llvm openmp. Reproducer:
If you change the USE_JEMALLOC=OFF, it will work. |
thanks, should we add a check of sorts to avoid this happening? Do you have more info about this incompatibility? |
For now we can't recommend compiling with jemalloc anyways. See the reasoning at #17324
No. It's just an empirical observation. |
Seeing the issue again in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17751/4/pipeline It's the same pipeline @szha reported as failing above. That pipeline runs the following build The build log associated with the build used for above failing pipeline is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17751/4/pipeline/51, specifically http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-17751/runs/4/nodes/51/steps/294/log/?start=0 There are a couple of interesting points about this build and failure:
So I think we can conclude that the issue is not with jemalloc, but that there is an underlying MXNet bug and building with jemalloc and openmp makes the bug much easier to reproduce. |
somebody needs a stack trace. This kind of things always remind me of the Pontiac allergic to vanilla ice cream. |
I solved this problem by using mxnet-cuda90 version |
|
|
Description
I trained arcface with 8 gpus and met Segmentation fault after some iterations
Error Message
To Reproduce
I used code from repo insightface. And run train_parall.py with per-batch-size 50
What have you tried to solve it?
I tried to install different mxnet version (1.4.0, 1.4.1, 1.5.0, 1.5.1) by pip
Environment
The text was updated successfully, but these errors were encountered: