-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU inference freezes on server with SLURM task manager #10736
Comments
When invoked with python -m trace --trace:
|
Seems to be solved by set
and SLURM |
Hi, I ran into the same issue with a different program. But I don't have access to that program source code. Is there any workaround for this issue as a user without modifying the source code to add those two parameters? I'm curious why hasn't ONNX been fixed internally to handle SLURM correctly? |
The issue is because of CPU affinity set for new created threads, the default assigned CPU core may not be available from job scheduler when cgroup is enabled. One solution is to override the function pthread_setaffinity_np. The c code is available from https://raw.githubusercontent.com/wangsl/pthread-setaffinity/main/pthread-setaffinity.c to compile the code gcc -fPIC -shared -Wl,-soname,libpthread-setaffinity.so -ldl -o libpthread-setaffinity.so pthread-setaffinity.c then export LD_PRELOAD=libpthread-setaffinity.so Now it should work. |
Describe the bug
I tried to do inference with Python multiprocessing on a server with SLURM task manager. The program just got frozen (onnxruntime.InferenceSession blocked) and can not be terminated with CTRL+C. But it works just fine on a server without using SLURM. The relevant code lines: https://github.com/hzi-bifo/RiboDetector/blob/ae40ae4a49ceb63a39297c3ae7b6d92581c6ab7b/ribodetector/detect_cpu.py#L73-L79
I also tried to set:
and
OMP_NUM_THREADS = 1
. Then the program can start to run, but the CPU load for each process is very low. The sum of the load of all processes is about 200% no matter how many processes I set with multiprocessing. On a server without using SLURM the CPU load of each process is normal i.e. 100%.Urgency
I used onnxruntime in the software I developed: https://github.com/hzi-bifo/RiboDetector. And recently I got lots of users. All the users using SLURM encountered this issue.
System information
To Reproduce
Run the following code lines with SLURM
the model file is: https://github.com/hzi-bifo/RiboDetector/blob/pip/ribodetector/data/ribodetector_600k_variable_len70_101_epoch47.onnx
Expected behavior
The InferenceSession can be created and the inference can run with ~100% CPU load for each process
Additional context
Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.
The text was updated successfully, but these errors were encountered: