-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[BUG] Using a package with MKL and GPU versions, using python to open a new process will cause an error #14979
Comments
Hey, this is the MXNet Label Bot. |
@fierceX I'm not sure and don't know why but can you try the magic below? ;)
|
@TaoLv Yes, this will not go wrong, but this should be a bug. What is the specific reason? |
@mxnet-label-bot add [question, MKL] |
Looks like an OpenMP related problem. Since the stack trace has libc on it I suspect we are re-entering MXNet on pthread_at_fork handlers due to Python multiprocessing interaction. Since you are using multiprocessing, this could be done above the python level to avoid this situation. |
I would suggest to reproduce with debug symbols as the stack is not including the function names. |
ping |
I tried in GPU version, also no crash in debug mode.
Revision 9d7fc7c |
I could reproduce with binary distribution cu101mkl
|
Reproduced with a release CMake build
Flags:
|
Can't reproduce with Debug builds. |
RelWithDebSymbols:
No crash. |
@fierceX forking an initial process is not supported in MXNet. The first Process creation should not be done, as the state of the library after fork is inconsitent. The code in the train function is never executed. With respect to the crash, after investigating this I believe is caused by calling setenv in pthread_at_fork. I will refactor this code so unsafe calls to setenv are not done during forking. Additionally we can detect that we are in a forked state and emit additional errors in MXNet for example during the use of DataLoader. |
…cal concurrency crashes. (apache#15762) * Refactor LibraryInitializer so it's thread safe. Fixes apache#13438 Fixes apache#14979 * Refactor around lib loading * Fix lint * CR * Add option to choose between OMP implementations * Fix bug * Fix from CR
There are currently two hypotheses about the root cause of this error (#14979 (comment)): a) bug in llvm / intel openmp b) interaction between gomp and llvm / intel openmp. I did some more investigation and conclude we can rule out option b. In particular, I compile We can investigate the shared library dependencies of the resulting
among those,
Thus I recompile OpenBLAS with clang. Then we can investigate the transitive dependencies while replacing the system OpenBLAS with the llvm-openmp based OpenBLAS:
and you find that So let's see if the test case by @fierceX still crashes:
As the crash remains, we can conclude this is due to a bug in As @fierceX's use-case is common and important among the MXNet users, we can thus conclude that we must not default to llvm openmp until this issue is fixed. On a sidenote, using forking in a multithreaded environment is according to the POSIX standard generally largely undefined (you're only allowed to @cjolivier01 please let me know if you see any issue with this investigation. PS: To compile with clang, a small change to |
what is the source file and line number of that crash in libmxnet.so? What’s the line of code crashing? |
@leezu, not sure if the problem of LLVM OMP is the same one as described at the beginning of this issue. I simply took the original issue as a problem of iomp5 which has been removed from all the binary releases of MXNet. Hence the issue was closed. |
The library can be found at: https://software.intel.com/en-us/parallel-studio-xe/choose-download |
Is the iomp source code based on llvm? What version of llvm omp would the 2019.0 update correspond to? Or is the source different? I'll try it. |
which line? the stack trace you listed says libmxnet.so at stack level 0 rather than libomp.so, so it wouldn’t be in the omp calls here, correct, is the CHECK_GE failing? |
The function referred to above is the libmxnet.so is on Specifically Line 65. |
thanks now the stack trace makes sense. maybe libomp isn’t built with -O0 in debug mode — assuming this is a debug build. |
this problem is gone with the upgrade? |
If it’s gone with the upgrade, then fine. However, if it’s not, and since it also happened with official dist of libiomp5 (and if still happening, also official llvm dist) then considering that llvm omp is in HUGE distribution globally being part of clang and all, then it seems pretty unlikely to me that it’s a bug in the openmp causing this. Especially since I wrote most of this omp-related stuff in mxnet that is in that stack trace, and I definitely didn’t test it specifically with forking — it wasn’t a use case at the time. in fact, at the time i wrote that, it was known that trying to use omp at all (with libgomp specifically) would hang if attempted in a forked process (there’s an issue+PR in there somewhere fixing the issue by avoiding using a kernel that used omp i seem to recall — it was a long time ago and before llvm openmp was added — it was noted that it didn’t happen in mkl build which used libiomp5 instead). generally this wasn’t a problem because the OMP_NUM_THREADS gets set to 1 in the atfork call by the engine code. However, if mxnet is loaded after the fork, then that environment variable was never set because the engine code never ran to hook the fork call before then. I think it’s possible there’s a bug in the (my) mxnet omp code since this wasn’t a use-case considered. This would mean it would likely still occur with clang builds (assuming it’s not intermittent and hard to reproduce). |
regarding the libgomp hang i noted above, apparently it’s a known issue with libgomp and forking that I am surprised to see still occurring today. He lists gcc 8. |
Yes, this crash still happens after the upgrade of llvm openmp. It also happens both when compiling with gcc or compiling with llvm. The only case where the crash does not happen is when compiling with gcc and libgomp instead of the libomp. The gcc hang you refer to above, is it https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035 ? I'm aware of that bug report and therefore quite surprised that the crash doesn't occur with gcc libgomp, but rather in all other settings.. |
i will see if i can reproduce this week they link to the gcc issue in that pytorch issue link. it’s not that one you linked which is a pull request of sorts. (correction: i thought it was the PR one while looking on my phone but now on my PC I see it isn't). Yes, seems it's that issue as well. it’s weird because i saw that behavior maybe three years ago using gcc 3.x ish, I think, so I assumed that libgomp had been corrected to handle fork properly since then. I am surprised that the same behavior is being reproduced in such a new gcc version. i want to try to reprocess that issue this week as well. llvm openmp as you can see from the comments in the pytorch is known to handle forking correctly. Another thing that pytorch issue mentions is cuda after a fork. while it’s reasonable to assume it’s illegal to use cuda and then fork and also use cuda in the forked process as well, i wonder if it works ok if you fork before using cuda for the first time as in this issue. |
It should be different code base. |
This is loading the wrong omp library from the one that it was just built against. That library comes with (on Ubuntu) libomp5.deb package. The proper one would be in cmake-build-debug/3rdparty/openmp at build time. Why it is loading that other one I did not track down, because the problem went away when I linked to the proper library in the libmxnet.so's dir. I also uninstalled the libomp5 package on my machine in the course of testing. It might be getting pulled in because the cython compile uses a different "toolchain" (which may or may not map back to the same compiler, which on my machine is just blindly running x86_64-linux-gnu-gcc in the path). Even if this is not the cause, it should be looked at because with more than one toolchain on a lot of dev boxes these days, this is a recipe for trouble. Since the cython library has libmxnet as a dependency, then it is conceiable that in some use-cases, it gets first stab at loading whatever shared object it wants, and so if not using the same toolchain, this could get pretty nasty (i.e. imagine libmxnet.so is forced, at load time, to link against libstdc++ from gcc 3.6 when mxnet was compiled with gcc 8). I know they have version tags in the symbols, but you get the idea, right? This should be looked into, imho. btw this is why the location for the omp stack trace was ??? -- no debug info for "/usr/lib/x86_64-linux-gnu/libomp.so.5". At any rate, there's a number of ways to resolve this just as one would resolve the wrong opencv library being loaded -- it's not rocket science :) Summary: No evidence found suggesting tha this is a libomp5/libomp bug (the "upgrade" wasn't actually necessary, but doesn't hurt anything, so good to leave it in).
|
By the way, I don't care if it's used or not, but on the MXNet cython branch, I did some cython stuff that, in the cmake files, uses the mxnet toolchain to build the cython library. That's one approach, there are other approaches -- there's pros and cons for each. |
@cjolivier01 thank you for looking into this. I notice that the crash also happens when using the system llvm openmp at compile time (ie delete BTW, the update of the |
Assert issue is fixed in referenced PR above. The cython setup script apparently uses forking which is causing the problem during compile. |
You can see in the newer version, they set in the atfork handler: __kmp_atfork_child()
{
...
__kmp_team_pool = NULL;
...
} This is why the assert goes away, but the assert remains harmless even in the old version. |
Thanks for looking into it. Even when harmless, it's annoying when using debug build. Thus it's good to make it go away. |
When using system llvm openmp instead of |
Fixed by #17039 Thanks Chris. |
This seems fixed now. |
Hardware and version information:
----------Python Info----------
Version : 3.6.8
Compiler : GCC 7.3.0
Build : ('default', 'Dec 30 2018 01:22:34')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 19.1.1
Directory : /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 1.4.1
Directory : /home/bird/miniconda3/envs/test/lib/python3.6/site-packages/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.15.0-50-generic-x86_64-with-debian-buster-sid
system : Linux
node : ctmp
release : 4.15.0-50-generic
version : #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Stepping: 3
CPU MHz: 800.218
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 6816.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
Python package version
In a GPU package with MKL, if you create a new process in Python and use multiple processes to load data at the same time, you will get an error.
If you change the mxnet version to
mxnet-cu100-1.4.1
, there will be no errors.Similarly,
mxnet-cu100mkl-1.5.0b20190516
will fail andmxnet-cu100-1.5.0b20190516
will not go wrong.In addition
Using cpu
Remove the num_workers parameter
Do not create a new process
Nothing in any of the above three cases will go wrong
The text was updated successfully, but these errors were encountered: