Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Segmentation fault: 11 #17043

Open
tranvanhoa533 opened this issue Dec 11, 2019 · 16 comments
Open

Segmentation fault: 11 #17043

tranvanhoa533 opened this issue Dec 11, 2019 · 16 comments
Labels

Comments

@tranvanhoa533
Copy link

Description

I trained arcface with 8 gpus and met Segmentation fault after some iterations

Error Message

INFO:root:Iter[0] Batch [25300]	Speed: 521.90 samples/sec
INFO:root:Iter[25320] fc7_acc 0.0 	 fc7_ce 13.38002197265625
INFO:root:Iter[0] Batch [25320]	Speed: 515.28 samples/sec
INFO:root:Iter[25340] fc7_acc 0.0 	 fc7_ce 13.379957275390625
INFO:root:Iter[0] Batch [25340]	Speed: 514.40 samples/sec
INFO:root:Iter[25360] fc7_acc 0.004999999888241291 	 fc7_ce 13.37499755859375
INFO:root:Iter[0] Batch [25360]	Speed: 495.12 samples/sec
INFO:root:Iter[25380] fc7_acc 0.0 	 fc7_ce 13.380018310546875
INFO:root:Iter[0] Batch [25380]	Speed: 517.99 samples/sec
INFO:root:Iter[25400] fc7_acc 0.0 	 fc7_ce 13.3799951171875
INFO:root:Iter[0] Batch [25400]	Speed: 516.58 samples/sec
INFO:root:Iter[25420] fc7_acc 0.0024999999441206455 	 fc7_ce 13.377520751953124
INFO:root:Iter[0] Batch [25420]	Speed: 499.38 samples/sec
INFO:root:Iter[25440] fc7_acc 0.0024999999441206455 	 fc7_ce 13.37696044921875
INFO:root:Iter[0] Batch [25440]	Speed: 515.53 samples/sec
INFO:root:Iter[25460] fc7_acc 0.0 	 fc7_ce 13.3800244140625
INFO:root:Iter[0] Batch [25460]	Speed: 527.34 samples/sec
INFO:root:Iter[25480] fc7_acc 0.0 	 fc7_ce 13.38001953125
INFO:root:Iter[0] Batch [25480]	Speed: 504.62 samples/sec
INFO:root:Iter[25500] fc7_acc 0.0 	 fc7_ce 13.38001953125
INFO:root:Iter[0] Batch [25500]	Speed: 527.38 samples/sec
INFO:root:Iter[25520] fc7_acc 0.0 	 fc7_ce 13.37994873046875
INFO:root:Iter[0] Batch [25520]	Speed: 514.74 samples/sec

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4015ca) [0x7f7ca48725ca]
[bt] (1) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x341c826) [0x7f7ca788d826]
[bt] (2) /lib64/libc.so.6(+0x363b0) [0x7f7da0e303b0]
[bt] (3) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x309b98e) [0x7f7ca750c98e]
[bt] (4) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a03d5) [0x7f7ca75113d5]
[bt] (5) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a0f6f) [0x7f7ca7511f6f]
[bt] (6) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x2e8) [0x7f7ca71d0618]
[bt] (7) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb0b09) [0x7f7ca7121b09]
[bt] (8) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cba444) [0x7f7ca712b444]
[bt] (9) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cbe5d2) [0x7f7ca712f5d2]

To Reproduce

I used code from repo insightface. And run train_parall.py with per-batch-size 50

What have you tried to solve it?

I tried to install different mxnet version (1.4.0, 1.4.1, 1.5.0, 1.5.1) by pip

Environment

  • Python 3.6
  • Centos 7.6
  • Cuda 10.0
@lucasxlu
Copy link

lucasxlu commented Jan 7, 2020

@tranvanhoa533 Facing the same problem, did you solve it?

@leezu
Copy link
Contributor

leezu commented Jan 7, 2020

You can build MXNet from source with debug info enabled. Then the backtrace above will be more meaningful

@larroy
Copy link
Contributor

larroy commented Feb 2, 2020

Seeing another segfault here:

[2020-02-02T12:28:04.611Z] [12:28:04] cpp-package/example/test_score.cpp:155: Epoch: 4 245902 samples/sec Accuracy: 0.9072

[2020-02-02T12:28:04.864Z] [12:28:04] cpp-package/example/test_score.cpp:155: Epoch: 5 247934 samples/sec Accuracy: 0.9192

[2020-02-02T12:28:05.117Z] [12:28:04] cpp-package/example/test_score.cpp:155: Epoch: 6 248963 samples/sec Accuracy: 0.9259

[2020-02-02T12:28:05.370Z] [12:28:05] cpp-package/example/test_score.cpp:155: Epoch: 7 251046 samples/sec Accuracy: 0.9296

[2020-02-02T12:28:05.370Z] [12:28:05] cpp-package/include/mxnet-cpp/lr_scheduler.h:81: Update[5001]: Change learning rate to 0.01

[2020-02-02T12:28:05.624Z] [12:28:05] cpp-package/example/test_score.cpp:155: Epoch: 8 250000 samples/sec Accuracy: 0.9388

[2020-02-02T12:28:05.877Z] [12:28:05] cpp-package/example/test_score.cpp:155: Epoch: 9 246914 samples/sec Accuracy: 0.9396

[2020-02-02T12:28:13.938Z] [12:28:13] cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.

[2020-02-02T12:28:13.938Z] 

[2020-02-02T12:28:13.938Z] Segmentation fault: 11

[2020-02-02T12:28:13.938Z] 

[2020-02-02T12:28:13.938Z] *** Error in `./test_regress_label': double free or corruption (fasttop): 0x0000000000f2e630 ***

[2020-02-02T12:28:13.938Z] ======= Backtrace: =========

[2020-02-02T12:28:13.938Z] /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f64b98d37e5]

[2020-02-02T12:28:13.938Z] /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f64b98dc37a]

[2020-02-02T12:28:13.938Z] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f64b98e053c]

[2020-02-02T12:28:13.938Z] /usr/lib/x86_64-linux-gnu/libcublas.so.10(+0x5f33a8)[0x7f64b5c343a8]

[2020-02-02T12:28:13.938Z] /usr/lib/x86_64-linux-gnu/libcublas.so.10(+0x5f3650)[0x7f64b5c34650]

[2020-02-02T12:28:13.938Z] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0x9a)[0x7f64b989636a]

[2020-02-02T12:28:13.938Z] /usr/lib/x86_64-linux-gnu/libcublas.so.10(+0x280d6)[0x7f64b56690d6]

[2020-02-02T12:28:13.938Z] ======= Memory map: ========

[2020-02-02T12:28:13.938Z] 00400000-0041d000 r-xp 00000000 ca:01 7429019                            /work/mxnet/cpp-package/example/test_regress_label

[2020-02-02T12:28:13.938Z] 0061d000-0061e000 r--p 0001d000 ca:01 7429019                            /work/mxnet/cpp-package/example/test_regress_label

[2020-02-02T12:28:13.938Z] 0061e000-0061f000 rw-p 0001e000 ca:01 7429019                            /work/mxnet/cpp-package/example/test_regress_label

[2020-02-02T12:28:13.938Z] 00b85000-018e7000 rw-p 00000000 00:00 0                                  [heap]

[2020-02-02T12:28:13.938Z] 7f63f0000000-7f63f0021000 rw-p 00000000 00:00 0 

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15990/19/pipeline

@larroy
Copy link
Contributor

larroy commented Feb 4, 2020

piotr@34-222-189-162:0: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label
[03:31:57] ../cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[03:31:57] ../src/executor/graph_executor.cc:2064: Subgraph backend MKLDNN is activated.

Segmentation fault: 11

Stack trace:
  [bt] (0) build/cpp-package/example/test_regress_label(+0x3cb7d9) [0x5573362a57d9]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7fd9f7d67f20]
  [bt] (2) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(__kmp_fork_call+0x58a) [0x7fd9f9a8784a]
  [bt] (3) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(+0x9bb5d) [0x7fd9f9adcb5d]
  [bt] (4) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(GOMP_parallel+0x9c) [0x7fd9f9ae1b1c]
  [bt] (5) build/cpp-package/example/test_regress_label(+0x19fd4d3) [0x5573378d74d3]
  [bt] (6) build/cpp-package/example/test_regress_label(+0x1a186e1) [0x5573378f26e1]
  [bt] (7) build/cpp-package/example/test_regress_label(+0x3c1ca7) [0x55733629bca7]
  [bt] (8) build/cpp-package/example/test_regress_label(+0x3c2427) [0x55733629c427]
piotr@34-222-189-162:255: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label
[03:32:01] ../cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[03:32:01] ../src/executor/graph_executor.cc:2064: Subgraph backend MKLDNN is activated.

Segmentation fault: 11

Stack trace:
  [bt] (0) build/cpp-package/example/test_regress_label(+0x3cb7d9) [0x56400eba47d9]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f7b94a0af20]
  [bt] (2) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(__kmp_fork_call+0x58a) [0x7f7b9672a84a]
  [bt] (3) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(+0x9bb5d) [0x7f7b9677fb5d]
  [bt] (4) /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so(GOMP_parallel+0x9c) [0x7f7b96784b1c]
  [bt] (5) build/cpp-package/example/test_regress_label(+0x19fd4d3) [0x5640101d64d3]
  [bt] (6) build/cpp-package/example/test_regress_label(+0x1a186e1) [0x5640101f16e1]
  [bt] (7) build/cpp-package/example/test_regress_label(+0x3c1ca7) [0x56400eb9aca7]
  [bt] (8) build/cpp-package/example/test_regress_label(+0x3c2427) [0x56400eb9b427]
piotr@34-222-189-162:0: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label
[03:32:01] ../cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[03:32:01] ../src/executor/graph_executor.cc:2064: Subgraph backend MKLDNN is activated.
piotr@34-222-189-162:0: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label
[03:32:02] ../cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[03:32:02] ../src/executor/graph_executor.cc:2064: Subgraph backend MKLDNN is activated.
piotr@34-222-189-162:0: ~/mxnet [v1.6.x]> build/cpp-package/example/test_regress_label

@larroy
Copy link
Contributor

larroy commented Feb 4, 2020

Testing build/cpp-package/example/test_regress_label




Thread 58 "test_regress_la" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff6ebff700 (LWP 20624)]
0x00007ffff43b284a in __kmp_fork_call () from /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
(gdb) bt
#0  0x00007ffff43b284a in __kmp_fork_call () from /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
#1  0x00007ffff4407b5d in __kmp_GOMP_fork_call () from /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
#2  0x00007ffff440cb1c in __kmp_api_GOMP_parallel_40_alias () from /home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
#3  0x0000555556f514d3 in void mxnet::op::normal_op<mshadow::cpu, mxnet::op::SampleNormalParam>(nnvm::NodeAttrs const&, mxnet::OpContext const&, mxnet::OpReqType const&, mxnet::TBlob*) [clo
ne .isra.685] ()
#4  0x0000555556f6c6e1 in void mxnet::op::Sample_<mshadow::cpu, mxnet::op::SampleNormalParam>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet
::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) ()
#5  0x0000555555915ca7 in mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const
&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&
, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::v
ector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::ND
Array*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::o
perator()(mxnet::RunContext) const ()
#6  0x0000555555916427 in std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<m
xnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)
> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std
::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std:
:vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReq
Type> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&) ()
#7  0x0000555555856675 in std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::PushSync(std::function<void (mxnet::RunContext)>
, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProp
erty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&) ()
#8  0x0000555555861ed6 in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) ()
#9  0x0000555555862547 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#
1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) ()
#10 0x0000555555860c5a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_ru
n() ()
#11 0x00007ffff30b866f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007ffff41546db in start_thread (arg=0x7fff6ebff700) at pthread_create.c:463
#13 0x00007ffff277588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

@larroy
Copy link
Contributor

larroy commented Feb 4, 2020

This doesn't seem to happen when I don't link with llvm openmp in 3rdparty.

Happens aprox 50% of the time with this test.

@larroy
Copy link
Contributor

larroy commented Feb 4, 2020

Looks like it might be related to mixing openmp implementations even though there were several fixes recently for related issues like #14979

@leezu
Copy link
Contributor

leezu commented Feb 4, 2020

@larroy the issue you report is due to an incompatibility between the jemalloc version used and llvm openmp.

Reproducer:

  git clone --recursive https://github.com/apache/incubator-mxnet/ mxnet
  cd mxnet
  git checkout a726c406964b9cd17efa826738a662e09d973972 # workaround 
https://github.com/apache/incubator-mxnet/issues/17514
  mkdir build; cd build;
  cmake -DUSE_CPP_PACKAGE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
-DUSE_CUDA=OFF -DUSE_JEMALLOC=ON ..
  ninja
  ./cpp-package/example/test_regress_label  # run a 2-3 times to reproduce

If you change the USE_JEMALLOC=OFF, it will work.

@larroy
Copy link
Contributor

larroy commented Feb 5, 2020

thanks, should we add a check of sorts to avoid this happening? Do you have more info about this incompatibility?

@leezu
Copy link
Contributor

leezu commented Feb 5, 2020

For now we can't recommend compiling with jemalloc anyways. See the reasoning at #17324

Do you have more info about this incompatibility?

No. It's just an empirical observation.

@leezu
Copy link
Contributor

leezu commented Mar 4, 2020

Seeing the issue again in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17751/4/pipeline

It's the same pipeline @szha reported as failing above.

That pipeline runs the following build

https://github.com/apache/incubator-mxnet/blob/5cffa744859658d8192041eafcdcfcf176d27482/ci/docker/runtime_functions.sh#L762-L779

The build log associated with the build used for above failing pipeline is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17751/4/pipeline/51, specifically http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-17751/runs/4/nodes/51/steps/294/log/?start=0

There are a couple of interesting points about this build and failure:

  1. the build is unrelated to llvm openmp, by the nature of our Makefile build not supporting llvm openmp.
  2. the build does not use jemalloc.

So I think we can conclude that the issue is not with jemalloc, but that there is an underlying MXNet bug and building with jemalloc and openmp makes the bug much easier to reproduce.

@larroy
Copy link
Contributor

larroy commented Mar 4, 2020

somebody needs a stack trace. This kind of things always remind me of the Pontiac allergic to vanilla ice cream.

@licaoyuan123
Copy link

I solved this problem by using mxnet-cuda90 version

@szha
Copy link
Member

szha commented Apr 29, 2020

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/centos-gpu/branches/PR-18146/runs/26/nodes/78/steps/127/log/?start=0

[2020-04-28T23:53:28.611Z] [gw0] [  1%] PASSED tests/python/gpu/test_gluon_gpu.py::test_req 
[2020-04-28T23:53:28.869Z] Fatal Python error: Segmentation fault
[2020-04-28T23:53:28.869Z] 
[2020-04-28T23:53:28.869Z] Thread 0x00007f00a7843700 (most recent call first):
...
[2020-04-28T23:53:29.127Z] tests/python/gpu/test_gluon_gpu.py::test_hybrid_multi_context 
[2020-04-28T23:53:29.127Z] [gw1] node down: Not properly terminated
[2020-04-28T23:53:29.127Z] [gw1] [  1%] FAILED tests/python/gpu/test_gluon_gpu.py::test_symbol_block 

@szha
Copy link
Member

szha commented Apr 29, 2020

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-18146/runs/26/nodes/414/steps/462/log/?start=0

[2020-04-29T00:02:13.773Z] [gw1] [  2%] PASSED tests/python/gpu/test_gluon_gpu.py::test_sequential 
[2020-04-29T00:02:14.333Z] Fatal Python error: Segmentation fault
[2020-04-29T00:02:14.333Z] 
[2020-04-29T00:02:14.333Z] Thread 0x00007fd7b6986700 (most recent call first):
...
[2020-04-29T00:02:14.588Z] tests/python/gpu/test_gluon_gpu.py::test_export 
[2020-04-29T00:02:14.588Z] [gw1] node down: Not properly terminated
[2020-04-29T00:02:14.588Z] [gw1] [  2%] FAILED tests/python/gpu/test_gluon_gpu.py::test_export 

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants