Segmentation fault: 11 #17043

tranvanhoa533 · 2019-12-11T02:57:22Z

Description

I trained arcface with 8 gpus and met Segmentation fault after some iterations

Error Message

INFO:root:Iter[0] Batch [25300]	Speed: 521.90 samples/sec
INFO:root:Iter[25320] fc7_acc 0.0 	 fc7_ce 13.38002197265625
INFO:root:Iter[0] Batch [25320]	Speed: 515.28 samples/sec
INFO:root:Iter[25340] fc7_acc 0.0 	 fc7_ce 13.379957275390625
INFO:root:Iter[0] Batch [25340]	Speed: 514.40 samples/sec
INFO:root:Iter[25360] fc7_acc 0.004999999888241291 	 fc7_ce 13.37499755859375
INFO:root:Iter[0] Batch [25360]	Speed: 495.12 samples/sec
INFO:root:Iter[25380] fc7_acc 0.0 	 fc7_ce 13.380018310546875
INFO:root:Iter[0] Batch [25380]	Speed: 517.99 samples/sec
INFO:root:Iter[25400] fc7_acc 0.0 	 fc7_ce 13.3799951171875
INFO:root:Iter[0] Batch [25400]	Speed: 516.58 samples/sec
INFO:root:Iter[25420] fc7_acc 0.0024999999441206455 	 fc7_ce 13.377520751953124
INFO:root:Iter[0] Batch [25420]	Speed: 499.38 samples/sec
INFO:root:Iter[25440] fc7_acc 0.0024999999441206455 	 fc7_ce 13.37696044921875
INFO:root:Iter[0] Batch [25440]	Speed: 515.53 samples/sec
INFO:root:Iter[25460] fc7_acc 0.0 	 fc7_ce 13.3800244140625
INFO:root:Iter[0] Batch [25460]	Speed: 527.34 samples/sec
INFO:root:Iter[25480] fc7_acc 0.0 	 fc7_ce 13.38001953125
INFO:root:Iter[0] Batch [25480]	Speed: 504.62 samples/sec
INFO:root:Iter[25500] fc7_acc 0.0 	 fc7_ce 13.38001953125
INFO:root:Iter[0] Batch [25500]	Speed: 527.38 samples/sec
INFO:root:Iter[25520] fc7_acc 0.0 	 fc7_ce 13.37994873046875
INFO:root:Iter[0] Batch [25520]	Speed: 514.74 samples/sec

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4015ca) [0x7f7ca48725ca]
[bt] (1) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x341c826) [0x7f7ca788d826]
[bt] (2) /lib64/libc.so.6(+0x363b0) [0x7f7da0e303b0]
[bt] (3) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x309b98e) [0x7f7ca750c98e]
[bt] (4) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a03d5) [0x7f7ca75113d5]
[bt] (5) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a0f6f) [0x7f7ca7511f6f]
[bt] (6) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x2e8) [0x7f7ca71d0618]
[bt] (7) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb0b09) [0x7f7ca7121b09]
[bt] (8) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cba444) [0x7f7ca712b444]
[bt] (9) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cbe5d2) [0x7f7ca712f5d2]

To Reproduce

I used code from repo insightface. And run train_parall.py with per-batch-size 50

What have you tried to solve it?

I tried to install different mxnet version (1.4.0, 1.4.1, 1.5.0, 1.5.1) by pip

Environment

Python 3.6
Centos 7.6
Cuda 10.0

The text was updated successfully, but these errors were encountered:

lucasxlu · 2020-01-07T08:51:32Z

@tranvanhoa533 Facing the same problem, did you solve it?

leezu · 2020-01-07T14:24:27Z

You can build MXNet from source with debug info enabled. Then the backtrace above will be more meaningful

larroy · 2020-02-02T19:33:23Z

Seeing another segfault here:

[2020-02-02T12:28:04.611Z] [12:28:04] cpp-package/example/test_score.cpp:155: Epoch: 4 245902 samples/sec Accuracy: 0.9072

[2020-02-02T12:28:04.864Z] [12:28:04] cpp-package/example/test_score.cpp:155: Epoch: 5 247934 samples/sec Accuracy: 0.9192

[2020-02-02T12:28:05.117Z] [12:28:04] cpp-package/example/test_score.cpp:155: Epoch: 6 248963 samples/sec Accuracy: 0.9259

[2020-02-02T12:28:05.370Z] [12:28:05] cpp-package/example/test_score.cpp:155: Epoch: 7 251046 samples/sec Accuracy: 0.9296

[2020-02-02T12:28:05.370Z] [12:28:05] cpp-package/include/mxnet-cpp/lr_scheduler.h:81: Update[5001]: Change learning rate to 0.01

[2020-02-02T12:28:05.624Z] [12:28:05] cpp-package/example/test_score.cpp:155: Epoch: 8 250000 samples/sec Accuracy: 0.9388

[2020-02-02T12:28:05.877Z] [12:28:05] cpp-package/example/test_score.cpp:155: Epoch: 9 246914 samples/sec Accuracy: 0.9396

[2020-02-02T12:28:13.938Z] [12:28:13] cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.

[2020-02-02T12:28:13.938Z] 

[2020-02-02T12:28:13.938Z] Segmentation fault: 11

[2020-02-02T12:28:13.938Z] 

[2020-02-02T12:28:13.938Z] *** Error in `./test_regress_label': double free or corruption (fasttop): 0x0000000000f2e630 ***

[2020-02-02T12:28:13.938Z] ======= Backtrace: =========

[2020-02-02T12:28:13.938Z] /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f64b98d37e5]

[2020-02-02T12:28:13.938Z] /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f64b98dc37a]

[2020-02-02T12:28:13.938Z] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f64b98e053c]

[2020-02-02T12:28:13.938Z] /usr/lib/x86_64-linux-gnu/libcublas.so.10(+0x5f33a8)[0x7f64b5c343a8]

[2020-02-02T12:28:13.938Z] /usr/lib/x86_64-linux-gnu/libcublas.so.10(+0x5f3650)[0x7f64b5c34650]

[2020-02-02T12:28:13.938Z] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0x9a)[0x7f64b989636a]

[2020-02-02T12:28:13.938Z] /usr/lib/x86_64-linux-gnu/libcublas.so.10(+0x280d6)[0x7f64b56690d6]

[2020-02-02T12:28:13.938Z] ======= Memory map: ========

[2020-02-02T12:28:13.938Z] 00400000-0041d000 r-xp 00000000 ca:01 7429019                            /work/mxnet/cpp-package/example/test_regress_label

[2020-02-02T12:28:13.938Z] 0061d000-0061e000 r--p 0001d000 ca:01 7429019                            /work/mxnet/cpp-package/example/test_regress_label

[2020-02-02T12:28:13.938Z] 0061e000-0061f000 rw-p 0001e000 ca:01 7429019                            /work/mxnet/cpp-package/example/test_regress_label

[2020-02-02T12:28:13.938Z] 00b85000-018e7000 rw-p 00000000 00:00 0                                  [heap]

[2020-02-02T12:28:13.938Z] 7f63f0000000-7f63f0021000 rw-p 00000000 00:00 0

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15990/19/pipeline

larroy · 2020-02-04T03:32:25Z

piotr@34-222-189-162:0: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label
[03:31:57] ../cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[03:31:57] ../src/executor/graph_executor.cc:2064: Subgraph backend MKLDNN is activated.

Segmentation fault: 11

Stack trace:
  [bt] (0) build/cpp-package/example/test_regress_label(+0x3cb7d9) [0x5573362a57d9]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7fd9f7d67f20]
  [bt] (2) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(__kmp_fork_call+0x58a) [0x7fd9f9a8784a]
  [bt] (3) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(+0x9bb5d) [0x7fd9f9adcb5d]
  [bt] (4) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(GOMP_parallel+0x9c) [0x7fd9f9ae1b1c]
  [bt] (5) build/cpp-package/example/test_regress_label(+0x19fd4d3) [0x5573378d74d3]
  [bt] (6) build/cpp-package/example/test_regress_label(+0x1a186e1) [0x5573378f26e1]
  [bt] (7) build/cpp-package/example/test_regress_label(+0x3c1ca7) [0x55733629bca7]
  [bt] (8) build/cpp-package/example/test_regress_label(+0x3c2427) [0x55733629c427]
piotr@34-222-189-162:255: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label
[03:32:01] ../cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[03:32:01] ../src/executor/graph_executor.cc:2064: Subgraph backend MKLDNN is activated.

Segmentation fault: 11

Stack trace:
  [bt] (0) build/cpp-package/example/test_regress_label(+0x3cb7d9) [0x56400eba47d9]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f7b94a0af20]
  [bt] (2) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(__kmp_fork_call+0x58a) [0x7f7b9672a84a]
  [bt] (3) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(+0x9bb5d) [0x7f7b9677fb5d]
  [bt] (4) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(GOMP_parallel+0x9c) [0x7f7b96784b1c]
  [bt] (5) build/cpp-package/example/test_regress_label(+0x19fd4d3) [0x5640101d64d3]
  [bt] (6) build/cpp-package/example/test_regress_label(+0x1a186e1) [0x5640101f16e1]
  [bt] (7) build/cpp-package/example/test_regress_label(+0x3c1ca7) [0x56400eb9aca7]
  [bt] (8) build/cpp-package/example/test_regress_label(+0x3c2427) [0x56400eb9b427]
piotr@34-222-189-162:0: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label
[03:32:01] ../cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[03:32:01] ../src/executor/graph_executor.cc:2064: Subgraph backend MKLDNN is activated.
piotr@34-222-189-162:0: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label
[03:32:02] ../cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[03:32:02] ../src/executor/graph_executor.cc:2064: Subgraph backend MKLDNN is activated.
piotr@34-222-189-162:0: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label

larroy · 2020-02-04T03:42:28Z

Testing build/cpp-package/example/test_regress_label




Thread 58 "test_regress_la" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff6ebff700 (LWP 20624)]
0x00007ffff43b284a in __kmp_fork_call () from /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
(gdb) bt
#0  0x00007ffff43b284a in __kmp_fork_call () from /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
#1  0x00007ffff4407b5d in __kmp_GOMP_fork_call () from /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
#2  0x00007ffff440cb1c in __kmp_api_GOMP_parallel_40_alias () from /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
#3  0x0000555556f514d3 in void mxnet::op::normal_op<mshadow::cpu, mxnet::op::SampleNormalParam>(nnvm::NodeAttrs const&, mxnet::OpContext const&, mxnet::OpReqType const&, mxnet::TBlob*) [clo
ne .isra.685] ()
#4  0x0000555556f6c6e1 in void mxnet::op::Sample_<mshadow::cpu, mxnet::op::SampleNormalParam>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet
::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) ()
#5  0x0000555555915ca7 in mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const
&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&
, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::v
ector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::ND
Array*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::o
perator()(mxnet::RunContext) const ()
#6  0x0000555555916427 in std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<m
xnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)
> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std
::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std:
:vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReq
Type> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&) ()
#7  0x0000555555856675 in std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::PushSync(std::function<void (mxnet::RunContext)>
, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProp
erty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&) ()
#8  0x0000555555861ed6 in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) ()
#9  0x0000555555862547 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#
1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) ()
#10 0x0000555555860c5a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_ru
n() ()
#11 0x00007ffff30b866f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007ffff41546db in start_thread (arg=0x7fff6ebff700) at pthread_create.c:463
#13 0x00007ffff277588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

larroy · 2020-02-04T03:45:20Z

This doesn't seem to happen when I don't link with llvm openmp in 3rdparty.

Happens aprox 50% of the time with this test.

larroy · 2020-02-04T03:49:14Z

Looks like it might be related to mixing openmp implementations even though there were several fixes recently for related issues like #14979

leezu · 2020-02-04T22:47:44Z

@larroy the issue you report is due to an incompatibility between the jemalloc version used and llvm openmp.

Reproducer:

  git clone --recursive https://github.com/apache/incubator-mxnet/ mxnet
  cd mxnet
  git checkout a726c406964b9cd17efa826738a662e09d973972 # workaround 
https://github.com/apache/incubator-mxnet/issues/17514
  mkdir build; cd build;
  cmake -DUSE_CPP_PACKAGE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
-DUSE_CUDA=OFF -DUSE_JEMALLOC=ON ..
  ninja
  ./cpp-package/example/test_regress_label  # run a 2-3 times to reproduce

If you change the USE_JEMALLOC=OFF, it will work.

larroy · 2020-02-05T01:14:19Z

thanks, should we add a check of sorts to avoid this happening? Do you have more info about this incompatibility?

leezu · 2020-02-05T01:18:48Z

For now we can't recommend compiling with jemalloc anyways. See the reasoning at #17324

Do you have more info about this incompatibility?

No. It's just an empirical observation.

szha · 2020-02-11T15:14:38Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17551/4/pipeline#step-697-log-1160

leezu · 2020-03-04T07:27:21Z

Seeing the issue again in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17751/4/pipeline

It's the same pipeline @szha reported as failing above.

That pipeline runs the following build

https://github.com/apache/incubator-mxnet/blob/5cffa744859658d8192041eafcdcfcf176d27482/ci/docker/runtime_functions.sh#L762-L779

The build log associated with the build used for above failing pipeline is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17751/4/pipeline/51, specifically http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-17751/runs/4/nodes/51/steps/294/log/?start=0

There are a couple of interesting points about this build and failure:

the build is unrelated to llvm openmp, by the nature of our Makefile build not supporting llvm openmp.
the build does not use jemalloc.

So I think we can conclude that the issue is not with jemalloc, but that there is an underlying MXNet bug and building with jemalloc and openmp makes the bug much easier to reproduce.

larroy · 2020-03-04T19:12:10Z

somebody needs a stack trace. This kind of things always remind me of the Pontiac allergic to vanilla ice cream.

licaoyuan123 · 2020-04-08T13:34:42Z

I solved this problem by using mxnet-cuda90 version

szha · 2020-04-29T04:33:06Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/centos-gpu/branches/PR-18146/runs/26/nodes/78/steps/127/log/?start=0

[2020-04-28T23:53:28.611Z] [gw0] [  1%] PASSED tests/python/gpu/test_gluon_gpu.py::test_req 
[2020-04-28T23:53:28.869Z] Fatal Python error: Segmentation fault
[2020-04-28T23:53:28.869Z] 
[2020-04-28T23:53:28.869Z] Thread 0x00007f00a7843700 (most recent call first):
...
[2020-04-28T23:53:29.127Z] tests/python/gpu/test_gluon_gpu.py::test_hybrid_multi_context 
[2020-04-28T23:53:29.127Z] [gw1] node down: Not properly terminated
[2020-04-28T23:53:29.127Z] [gw1] [  1%] FAILED tests/python/gpu/test_gluon_gpu.py::test_symbol_block

szha · 2020-04-29T05:01:51Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-18146/runs/26/nodes/414/steps/462/log/?start=0

[2020-04-29T00:02:13.773Z] [gw1] [  2%] PASSED tests/python/gpu/test_gluon_gpu.py::test_sequential 
[2020-04-29T00:02:14.333Z] Fatal Python error: Segmentation fault
[2020-04-29T00:02:14.333Z] 
[2020-04-29T00:02:14.333Z] Thread 0x00007fd7b6986700 (most recent call first):
...
[2020-04-29T00:02:14.588Z] tests/python/gpu/test_gluon_gpu.py::test_export 
[2020-04-29T00:02:14.588Z] [gw1] node down: Not properly terminated
[2020-04-29T00:02:14.588Z] [gw1] [  2%] FAILED tests/python/gpu/test_gluon_gpu.py::test_export

tranvanhoa533 added the Bug label Dec 11, 2019

szha mentioned this issue Apr 29, 2020

Segmentation fault: 11 #15647

Closed

eric-haibin-lin mentioned this issue Apr 30, 2020

Fix cpp test_regress_label test #18211

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault: 11 #17043

Segmentation fault: 11 #17043

tranvanhoa533 commented Dec 11, 2019

lucasxlu commented Jan 7, 2020

leezu commented Jan 7, 2020

larroy commented Feb 2, 2020

larroy commented Feb 4, 2020

larroy commented Feb 4, 2020 •

edited

Loading

larroy commented Feb 4, 2020

larroy commented Feb 4, 2020

leezu commented Feb 4, 2020

larroy commented Feb 5, 2020 •

edited

Loading

leezu commented Feb 5, 2020 •

edited

Loading

szha commented Feb 11, 2020

leezu commented Mar 4, 2020

larroy commented Mar 4, 2020

licaoyuan123 commented Apr 8, 2020

szha commented Apr 29, 2020

szha commented Apr 29, 2020

Segmentation fault: 11 #17043

Segmentation fault: 11 #17043

Comments

tranvanhoa533 commented Dec 11, 2019

Description

Error Message

To Reproduce

What have you tried to solve it?

Environment

lucasxlu commented Jan 7, 2020

leezu commented Jan 7, 2020

larroy commented Feb 2, 2020

larroy commented Feb 4, 2020

larroy commented Feb 4, 2020 • edited Loading

larroy commented Feb 4, 2020

larroy commented Feb 4, 2020

leezu commented Feb 4, 2020

larroy commented Feb 5, 2020 • edited Loading

leezu commented Feb 5, 2020 • edited Loading

szha commented Feb 11, 2020

leezu commented Mar 4, 2020

larroy commented Mar 4, 2020

licaoyuan123 commented Apr 8, 2020

szha commented Apr 29, 2020

szha commented Apr 29, 2020

larroy commented Feb 4, 2020 •

edited

Loading

larroy commented Feb 5, 2020 •

edited

Loading

leezu commented Feb 5, 2020 •

edited

Loading