-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InferenceSession
initialization hangs
#10166
Comments
Hi - Apologies - my tag is @mattloose so I didn't see this. I'll report the OS and ORT as soon as I can access the server :-) |
OS - CentOS Linux 7 (Core) onnx 1.10.2 |
Hi, |
I can't reproduce this error. It seems to me like an onnxruntime bug specific to your system. Not sure I can be of much help in tracking down this bug. |
Hi - just enquiring if there is any update here please? |
@mattloose the only thing I can think of that might help debugging why it's stalling is running it under |
So running strace on the python snippet using: strace python -m trace --trace test_10166.py gives: clock_gettime(CLOCK_REALTIME, {tv_sec=1642620667, tv_nsec=950101133}) = 0 Again it hangs here without ever completing. Installing on an ubuntu 20 system succeeds (but the required GPU is not available to us in that system). |
Thanks @mattloose My initial thought was, is NFS involved anywhere in this process? This is the type of hang you see when trying to read from a hard-mounted NFS share that is no longer accessible. If possible can you repeat with all the code/venv/model on a locally mounted drive (tmp/scratch)? However, looking at the
Does your cluster set any cgroups cpu limits? Can you export The last output from the runtime is -
@pranavsharma how do you turn this off? The session options on my (working) Ubuntu node report 0 for both import onnxruntime as ort
ort.set_default_logger_severity(0)
so = ort.SessionOptions()
print(so.inter_op_num_threads)
print(so.intra_op_num_threads)
model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider']) @mattloose what values do you see? If they are greater than 0 maybe try setting them to 0 and running again. import onnxruntime as ort
ort.set_default_logger_severity(0)
so = ort.SessionOptions()
print(so.inter_op_num_threads)
print(so.intra_op_num_threads)
so.inter_op_num_threads = 0
so.intra_op_num_threads = 0
model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'], sess_options=so) If none of this helps @mattloose can you also provide which kernal version and CPU you are running on ( |
Thanks - I will try this and let you know! In the meantime: uname -r 3.10.0-1160.31.1.el7.x86_64 lscpu |
OK - setting OMP_NUM_THREADS=1 Doesn't change the behaviour. In addition, both print statements: return 0 So - no joy! |
I seem to be having the same problem - I get the same output and hanging with the test script. I am also starting from the same Oxford Nanopore repository https://github.com/nanoporetech/remora which uses onnxruntime.. Debian v.10 (buster) uname -r lscpu |
Sometimes mine hangs (at the same place at @mattloose ), but I also sometimes get an error and a segfault. Does that give any clue? I also get 0 for both inter_op_num_threads and intra_op_num_threads. 2022-01-25 11:43:29.453451230 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off |
@benbfly this is very helpful thanks - it points to onnxruntime/core/platform/posix/env.cc#L180 as the source of the bad @mattloose @benbfly #8313 suggests setting import onnxruntime as ort
ort.set_default_logger_severity(0)
so = ort.SessionOptions()
so.inter_op_num_threads = 1
so.intra_op_num_threads = 1
print(so.inter_op_num_threads)
print(so.intra_op_num_threads)
model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'], sess_options=so) @pranavsharma is there any extra information we can provide to help with this? |
@iiSeymour running that code snippet gives me an error - specifically:
I'm probably running from an incorrect location or something? Any suggestions? |
In fairness - setting inter and intra op threads to 1 does change the behaviour - it no longer hangs as in the first post in this thread! Setting those values to 0 (or commenting out) reintroduces the hang behaviour. |
@mattloose great, to fix the I'm assuming setting both to
Can you check if it's specifically
If it's only the default value of |
I can confirm that any of those parameters work (and the model loads). So to confirm so.inter_op_num_threads = 0 Does not work. All other permutations DO work. Looking forward to trying this out soon. |
I get the same result. When I change it to anything but 0/0, it completes normally (I tried 1/1, 2/1, and 1/2). When I use 0/0, it either hangs or gives that "pthread_setaffinity_np failed" error and segfaults. @marcus1487 hopefully this can be implemented in Remora without reducing the efficiency. In the #8313 issue, they said it got slower when they set this equal to the number of CPU cores (but my understanding of this issue is quite limited): #8313 (comment) |
I've run a couple of quick tests and it seems that setting the I think it makes sense to leave this issue open since this issue remains, but when the new remora code is pushed I will close the bonito/remora issue. |
I shall look forward to the new remora code.... :-) |
I had a similar issue |
Same OS , same onnx version and onnxruntime version , Same problem |
Perhaps your cgroup has fewer cores than onnxruntime default to require. |
while using docker(ubuntu16) on kubernetes may cause this problem. using following setting solves the problem |
import onnxruntime as ort
ort.set_default_logger_severity(0)
so = ort.SessionOptions()
so.inter_op_num_threads = 1
so.intra_op_num_threads = 1
print(so.inter_op_num_threads)
print(so.intra_op_num_threads)
model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'], sess_options=so) tihis very helpful for below error: `2022-08-27 08:55:05.951373963 [I:onnxruntime:, inference_session.cc:262 operator()] Flush-to-zero and denormal-as-zero are off corrupted double-linked list |
Describe the bug
I use ONNX to release/distribute production models for modified base detection from Oxford Nanopore sequencing data in the Remora repository. A user has reported an issue where onnxruntime hangs indefinitely when initializing an inference session from one of these released models.
Urgency
As soon as possible, as these models are currently in production.
System information
@mattlooss may be able to provide more information here.
To Reproduce
See the details of the issue in this thread (nanoporetech/bonito#216), but the issue can be reproduced with the following snippet (after downloading this model:
Upon running the above snippet the reporting user sees the following message followed by the code stalling without completion.
Expected behavior
Model loaded and code able to continue execution.
Screenshots
Not applicable.
Additional context
Not applicable.
The text was updated successfully, but these errors were encountered: