CPU inference freezes on server with SLURM task manager #10736

dawnmy · 2022-03-02T19:07:02Z

Describe the bug

I tried to do inference with Python multiprocessing on a server with SLURM task manager. The program just got frozen (onnxruntime.InferenceSession blocked) and can not be terminated with CTRL+C. But it works just fine on a server without using SLURM. The relevant code lines: https://github.com/hzi-bifo/RiboDetector/blob/ae40ae4a49ceb63a39297c3ae7b6d92581c6ab7b/ribodetector/detect_cpu.py#L73-L79

I also tried to set:

so.intra_op_num_threads = 1
so.inter_op_num_threads = 1

and OMP_NUM_THREADS = 1. Then the program can start to run, but the CPU load for each process is very low. The sum of the load of all processes is about 200% no matter how many processes I set with multiprocessing. On a server without using SLURM the CPU load of each process is normal i.e. 100%.

Urgency
I used onnxruntime in the software I developed: https://github.com/hzi-bifo/RiboDetector. And recently I got lots of users. All the users using SLURM encountered this issue.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7, Ubuntu 18.04, Fedora release 29
ONNX Runtime installed from (source or binary): binary with pip3
ONNX Runtime version: 1.10.0, 1.7.0
Python version: Python3.8, Python3.9
Visual Studio version (if applicable):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

To Reproduce
Run the following code lines with SLURM

so = onnxruntime.SessionOptions()
so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
model = onnxruntime.InferenceSession(model_file, so)

the model file is: https://github.com/hzi-bifo/RiboDetector/blob/pip/ribodetector/data/ribodetector_600k_variable_len70_101_epoch47.onnx

Expected behavior
The InferenceSession can be created and the inference can run with ~100% CPU load for each process

Additional context
Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

The text was updated successfully, but these errors were encountered:

dawnmy · 2022-03-02T19:15:11Z

When invoked with python -m trace --trace:

detect_cpu.py(71):             cd, self.config['state_file'][model_file_ext]).replace('.pth', '.onnx')
detect_cpu.py(70):         self.model_file = os.path.join(
detect_cpu.py(74):         so = onnxruntime.SessionOptions()
detect_cpu.py(77):         so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
detect_cpu.py(79):         self.model = onnxruntime.InferenceSession(self.model_file, so)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(315):         Session.__init__(self)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(104):         self._sess = None
onnxruntime_inference_collection.py(105):         self._enable_fallback = True
onnxruntime_inference_collection.py(317):         if isinstance(path_or_bytes, str):
onnxruntime_inference_collection.py(318):             self._model_path = path_or_bytes
onnxruntime_inference_collection.py(319):             self._model_bytes = None
onnxruntime_inference_collection.py(326):         self._sess_options = sess_options
onnxruntime_inference_collection.py(327):         self._sess_options_initial = sess_options
onnxruntime_inference_collection.py(328):         self._enable_fallback = True
onnxruntime_inference_collection.py(329):         self._read_config_from_model = os.environ.get('ORT_LOAD_CONFIG_FROM_MODEL') == '1'
 --- modulename: _collections_abc, funcname: get
_collections_abc.py(659):         try:
_collections_abc.py(660):             return self[key]
 --- modulename: os, funcname: __getitem__
os.py(671):         try:
os.py(672):             value = self._data[self.encodekey(key)]
 --- modulename: os, funcname: encode
os.py(749):             if not isinstance(value, str):
os.py(751):             return value.encode(encoding, 'surrogateescape')
os.py(673):         except KeyError:
os.py(675):             raise KeyError(key) from None
_collections_abc.py(661):         except KeyError:
_collections_abc.py(662):             return default
onnxruntime_inference_collection.py(332):         disabled_optimizers = kwargs['disabled_optimizers'] if 'disabled_optimizers' in kwargs else None
onnxruntime_inference_collection.py(334):         try:
onnxruntime_inference_collection.py(335):             self._create_inference_session(providers, provider_options, disabled_optimizers)
 --- modulename: onnxruntime_inference_collection, funcname: _create_inference_session
onnxruntime_inference_collection.py(347):         available_providers = C.get_available_providers()
onnxruntime_inference_collection.py(350):         if 'TensorrtExecutionProvider' in available_providers:
onnxruntime_inference_collection.py(353):             self._fallback_providers = ['CPUExecutionProvider']
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
onnxruntime_inference_collection.py(357):                                                                         provider_options,
onnxruntime_inference_collection.py(358):                                                                         available_providers)
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
 --- modulename: onnxruntime_inference_collection, funcname: check_and_normalize_provider_args
onnxruntime_inference_collection.py(48):     if providers is None:
onnxruntime_inference_collection.py(49):         return [], []
onnxruntime_inference_collection.py(359):         if providers == [] and len(available_providers) > 1:
onnxruntime_inference_collection.py(366):         session_options = self._sess_options if self._sess_options else C.get_default_session_options()
onnxruntime_inference_collection.py(367):         if self._model_path:
onnxruntime_inference_collection.py(368):             sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)

dawnmy · 2022-03-07T12:38:53Z

Seems to be solved by set

so.intra_op_num_threads = 1
so.inter_op_num_threads = 1

and SLURM --cpus-per-task <num_cpus> --threads-per-core 1

gevro · 2024-05-08T02:35:23Z

Hi, I ran into the same issue with a different program. But I don't have access to that program source code. Is there any workaround for this issue as a user without modifying the source code to add those two parameters?

I'm curious why hasn't ONNX been fixed internally to handle SLURM correctly?

wangsl · 2024-05-11T02:11:31Z

The issue is because of CPU affinity set for new created threads, the default assigned CPU core may not be available from job scheduler when cgroup is enabled. One solution is to override the function pthread_setaffinity_np. The c code is available from

https://raw.githubusercontent.com/wangsl/pthread-setaffinity/main/pthread-setaffinity.c

to compile the code

gcc -fPIC -shared -Wl,-soname,libpthread-setaffinity.so -ldl -o libpthread-setaffinity.so pthread-setaffinity.c

then

export LD_PRELOAD=libpthread-setaffinity.so

Now it should work.

dawnmy mentioned this issue Mar 2, 2022

InferenceSession initialization hangs #10166

Open

HectorSVC added the core runtime issues related to core runtime label Mar 2, 2022

dawnmy mentioned this issue Mar 4, 2022

Process hangs in CPU detection step hzi-bifo/RiboDetector#6

Closed

dawnmy closed this as completed Mar 7, 2022

mkuemmel mentioned this issue Jun 9, 2022

ONNX does not run with slurm astrorama/SourceXtractorPlusPlus#482

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU inference freezes on server with SLURM task manager #10736

CPU inference freezes on server with SLURM task manager #10736

dawnmy commented Mar 2, 2022

dawnmy commented Mar 2, 2022

dawnmy commented Mar 7, 2022

gevro commented May 8, 2024

wangsl commented May 11, 2024

CPU inference freezes on server with SLURM task manager #10736

CPU inference freezes on server with SLURM task manager #10736

Comments

dawnmy commented Mar 2, 2022

dawnmy commented Mar 2, 2022

dawnmy commented Mar 7, 2022

gevro commented May 8, 2024

wangsl commented May 11, 2024