Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InferenceSession initialization hangs #10166

Open
marcus1487 opened this issue Jan 3, 2022 · 26 comments
Open

InferenceSession initialization hangs #10166

marcus1487 opened this issue Jan 3, 2022 · 26 comments

Comments

@marcus1487
Copy link

marcus1487 commented Jan 3, 2022

Describe the bug
I use ONNX to release/distribute production models for modified base detection from Oxford Nanopore sequencing data in the Remora repository. A user has reported an issue where onnxruntime hangs indefinitely when initializing an inference session from one of these released models.

Urgency
As soon as possible, as these models are currently in production.

System information
@mattlooss may be able to provide more information here.

To Reproduce
See the details of the issue in this thread (nanoporetech/bonito#216), but the issue can be reproduced with the following snippet (after downloading this model:

import onnxruntime as ort
ort.set_default_logger_severity(0)
model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'])

Upon running the above snippet the reporting user sees the following message followed by the code stalling without completion.

2022-01-03 19:53:44.686663926 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
2022-01-03 19:53:44.686776610 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true

Expected behavior
Model loaded and code able to continue execution.

Screenshots
Not applicable.

Additional context
Not applicable.

@pranavsharma
Copy link
Contributor

Tried on Ubuntu 20.04 with ORT 1.10 and couldn't repro it. Please specify the OS and the ORT ver.

image

@mattloose
Copy link

Hi - Apologies - my tag is @mattloose so I didn't see this.

I'll report the OS and ORT as soon as I can access the server :-)

@mattloose
Copy link

OS - CentOS Linux 7 (Core)

onnx 1.10.2
onnxruntime 1.10.0

@mattloose
Copy link

Hi,
Is there any update on this? Happy to test some other conditions if it helps. @marcus1487 do you have any further suggestions?
Thanks,
Matt

@marcus1487
Copy link
Author

I can't reproduce this error. It seems to me like an onnxruntime bug specific to your system. Not sure I can be of much help in tracking down this bug.

@mattloose
Copy link

Hi - just enquiring if there is any update here please?

@iiSeymour
Copy link
Contributor

@mattloose the only thing I can think of that might help debugging why it's stalling is running it under strace?

@mattloose
Copy link

So running strace on the python snippet using:

strace python -m trace --trace test_10166.py

gives:

clock_gettime(CLOCK_REALTIME, {tv_sec=1642620667, tv_nsec=950101133}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1642620667, tv_nsec=950143547}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1642620667, tv_nsec=950183218}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1642620667, tv_nsec=950340172}) = 0
write(2, "2022-01-19 19:31:07.950382586 [I"..., 1272022-01-19 19:31:07.950382586 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
) = 127
futex(0x2ad859519b64, FUTEX_WAKE_PRIVATE, 2147483647) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1642620667, tv_nsec=950502349}) = 0
write(2, "2022-01-19 19:31:07.950544763 [I"..., 1732022-01-19 19:31:07.950544763 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
) = 173
open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
read(3, "0-39\n", 8192) = 5
close(3) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1642620667, tv_nsec=950756204}) = 0
mmap(NULL, 2101248, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2ad85d3f5000
mprotect(0x2ad85d3f5000, 4096, PROT_NONE) = 0
clone(child_stack=0x2ad85d5f4fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2ad85d5f59d0, tls=0x2ad85d5f5700, child_tidptr=0x2ad85d5f59d0) = 92261
sched_getaffinity(92261, 128, [0]) = 128
sched_setaffinity(92261, 128, [0]) = 0
mmap(NULL, 2101248, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2ad85d5f6000
mprotect(0x2ad85d5f6000, 4096, PROT_NONE) = 0
clone(child_stack=0x2ad85d7f5fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2ad85d7f69d0, tls=0x2ad85d7f6700, child_tidptr=0x2ad85d7f69d0) = 92262
sched_setaffinity(92262, 128, [1]) = -1 EINVAL (Invalid argument)
futex(0x2ad8596dc1d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x2ad85d5f59d0, FUTEX_WAIT, 92261, NULL

Again it hangs here without ever completing. Installing on an ubuntu 20 system succeeds (but the required GPU is not available to us in that system).

@iiSeymour
Copy link
Contributor

Thanks @mattloose

My initial thought was, is NFS involved anywhere in this process? This is the type of hang you see when trying to read from a hard-mounted NFS share that is no longer accessible. If possible can you repeat with all the code/venv/model on a locally mounted drive (tmp/scratch)?

However, looking at the strace output the last syscall seems to suggest we are waiting for a thread that seems to never return. Before that sched_setaffinity fails which could be related. The man page describes EINVAL as:

       EINVAL The affinity bit mask mask contains no processors that are
              currently physically on the system and permitted to the
              thread according to any restrictions that may be imposed
              by cpuset cgroups or the "cpuset" mechanism described in
              cpuset(7).

       EINVAL (sched_getaffinity() and, in kernels before 2.6.9,
              sched_setaffinity()) cpusetsize is smaller than the size
              of the affinity mask used by the kernel.

Does your cluster set any cgroups cpu limits? Can you export OMP_NUM_THREADS=1 and see if that makes a difference?

The last output from the runtime is -

Creating and using per session threadpools since use_per_session_threads_ is true

@pranavsharma how do you turn this off?

The session options on my (working) Ubuntu node report 0 for both inter_op_num_threads and intra_op_num_threads (regardless of OMP_NUM_THREADS):

import onnxruntime as ort
ort.set_default_logger_severity(0)

so = ort.SessionOptions()
print(so.inter_op_num_threads)
print(so.intra_op_num_threads)

model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'])

@mattloose what values do you see? If they are greater than 0 maybe try setting them to 0 and running again.

import onnxruntime as ort
ort.set_default_logger_severity(0)

so = ort.SessionOptions()
print(so.inter_op_num_threads)
print(so.intra_op_num_threads)

so.inter_op_num_threads = 0
so.intra_op_num_threads = 0

model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'], sess_options=so)

If none of this helps @mattloose can you also provide which kernal version and CPU you are running on (uname -r, lscpu)?

@mattloose
Copy link

Thanks - I will try this and let you know!

In the meantime:

uname -r

3.10.0-1160.31.1.el7.x86_64

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 1
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
Stepping: 4
CPU MHz: 1000.000
CPU max MHz: 2000.0000
CPU min MHz: 1000.0000
BogoMIPS: 4000.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
NUMA node0 CPU(s): 0-19
NUMA node1 CPU(s): 20-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d

@mattloose
Copy link

OK - setting

OMP_NUM_THREADS=1

Doesn't change the behaviour.

In addition, both print statements:
print(so.inter_op_num_threads)
print(so.intra_op_num_threads)

return 0

So - no joy!

@benbfly
Copy link

benbfly commented Jan 24, 2022

I seem to be having the same problem - I get the same output and hanging with the test script. I am also starting from the same Oxford Nanopore repository https://github.com/nanoporetech/remora which uses onnxruntime..

Debian v.10 (buster)

uname -r
5.10.79-aufs-1

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
CPU MHz: 1556.735
CPU max MHz: 3414.5500
CPU min MHz: 1500.0000
BogoMIPS: 4491.41
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

@benbfly
Copy link

benbfly commented Jan 25, 2022

Sometimes mine hangs (at the same place at @mattloose ), but I also sometimes get an error and a segfault. Does that give any clue? I also get 0 for both inter_op_num_threads and intra_op_num_threads.

2022-01-25 11:43:29.453451230 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
2022-01-25 11:43:29.453502306 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
Traceback (most recent call last):
File "snippet.py", line 11, in
model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'], sess_options=so)
File "/usr/local/hurcs/megalodon/2.4.1/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in init
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/usr/local/hurcs/megalodon/2.4.1/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 368, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:183 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int ()(int, Eigen::ThreadPoolInterface), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg:
Segmentation fault

image

@iiSeymour
Copy link
Contributor

@benbfly this is very helpful thanks - it points to onnxruntime/core/platform/posix/env.cc#L180 as the source of the bad sched_setaffinity syscall and this looks to be the same issues as #8313.

@mattloose @benbfly #8313 suggests setting inter_op_num_threads and intra_op_num_threads to 1. Can you try:

import onnxruntime as ort
ort.set_default_logger_severity(0)

so = ort.SessionOptions()

so.inter_op_num_threads = 1
so.intra_op_num_threads = 1

print(so.inter_op_num_threads)
print(so.intra_op_num_threads)

model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'], sess_options=so)

@pranavsharma is there any extra information we can provide to help with this?

@mattloose
Copy link

@iiSeymour running that code snippet gives me an error - specifically:

onnxruntime.capi.onnxruntime_pybind11_state.NoSuchFile: [ONNXRuntimeError] : 3 : NO_SUCHFILE : Load model from modbase_model.onnx failed:Load model modbase_model.onnx failed. File doesn't exist

I'm probably running from an incorrect location or something? Any suggestions?

@mattloose
Copy link

In fairness - setting inter and intra op threads to 1 does change the behaviour - it no longer hangs as in the first post in this thread!

Setting those values to 0 (or commenting out) reintroduces the hang behaviour.

@iiSeymour
Copy link
Contributor

@mattloose great, to fix the NO_SUCHFILE error you just need the model file in the same directory as the code.

I'm assuming setting both to 1 will tank the performance (we will check).

-x: [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes. A value of 0 means the test will auto-select a default. Must >=0.

-y: [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means the test will auto-select a default. Must >=0.

Can you check if it's specifically 0 or if it's just values other than 1 (i.e. multithreading) with the following 3 conditions.

so.inter_op_num_threads = 2
so.intra_op_num_threads = 2

so.inter_op_num_threads = 2
so.intra_op_num_threads = 1

so.inter_op_num_threads = 1
so.intra_op_num_threads = 2

If it's only the default value of 0 that is causing this we can set these parameters in remora based of the number of CPU cores. If not, hopefully somebody on @microsoft team can help 🙏🏻

@mattloose
Copy link

I can confirm that any of those parameters work (and the model loads).

So to confirm

so.inter_op_num_threads = 0
so.intra_op_num_threads = 0

Does not work.

All other permutations DO work.

Looking forward to trying this out soon.

@benbfly
Copy link

benbfly commented Jan 25, 2022

I get the same result. When I change it to anything but 0/0, it completes normally (I tried 1/1, 2/1, and 1/2). When I use 0/0, it either hangs or gives that "pthread_setaffinity_np failed" error and segfaults.

@marcus1487 hopefully this can be implemented in Remora without reducing the efficiency. In the #8313 issue, they said it got slower when they set this equal to the number of CPU cores (but my understanding of this issue is quite limited): #8313 (comment)

image

@marcus1487
Copy link
Author

I've run a couple of quick tests and it seems that setting the intr[ea]_op_num_threads values to 1 actually increases the speed of remora in most settings tried thus far. When running megalodon with a single process 1 thread is a marginally slower than 0, but otherwise (several bonito settings and other megalodon settings) the 1 thread setting is either no different or faster than the 0 threads setting. I will get this into remora and try to get a new release pushed shortly.

I think it makes sense to leave this issue open since this issue remains, but when the new remora code is pushed I will close the bonito/remora issue.

@mattloose
Copy link

I shall look forward to the new remora code.... :-)

@dawnmy
Copy link

dawnmy commented Mar 2, 2022

#10736

I had a similar issue

@lippman1125
Copy link

OS - CentOS Linux 7 (Core)

onnx 1.10.2 onnxruntime 1.10.0

Same OS , same onnx version and onnxruntime version , Same problem

@RongchunYao
Copy link

Perhaps your cgroup has fewer cores than onnxruntime default to require.
Maybe set so.inter_op_num_threads and so.intra_op_num_threads less than cores that your group could use?

@ltcs11
Copy link

ltcs11 commented Apr 20, 2022

while using docker(ubuntu16) on kubernetes may cause this problem.
but the strange thing is that different node machine have different behavior, some would get stuck but some wouldn't.

using following setting solves the problem
so.inter_op_num_threads = 1
so.intra_op_num_threads = 1

@TingsongYu
Copy link

TingsongYu commented Aug 27, 2022

import onnxruntime as ort
ort.set_default_logger_severity(0)

so = ort.SessionOptions()

so.inter_op_num_threads = 1
so.intra_op_num_threads = 1

print(so.inter_op_num_threads)
print(so.intra_op_num_threads)

model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'], sess_options=so)

tihis very helpful for below error:

`2022-08-27 08:55:05.951373963 [I:onnxruntime:, inference_session.cc:262 operator()] Flush-to-zero and denormal-as-zero are off
2022-08-27 08:55:05.951412706 [I:onnxruntime:, inference_session.cc:270 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2022-08-27 08:55:05.951433770 [I:onnxruntime:, inference_session.cc:291 ConstructorCommon] Dynamic block base set to 0
Traceback (most recent call last):
File "/home/seg_det_proj/yolov5-trash/inference_onnx.py", line 273, in
model = YOLOV5ONNX(onnx_path=path_onnx, classes=classes)
File "/home/seg_det_proj/yolov5-trash/inference_onnx.py", line 50, in init
self.onnx_session = ort.InferenceSession(onnx_path, providers=["CUDAExecutionProvider"])
File "/root/miniconda3/envs/myconda/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in init
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/root/miniconda3/envs/myconda/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 384, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:183 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int ()(int, Eigen::ThreadPoolInterface), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 22 error msg: Invalid argument

corrupted double-linked list
Aborted (core dumped)`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants