[RFC] MXNet Multithreaded Inference Interface #16431

anirudh2290 · 2019-10-10T22:29:44Z

Thanks to @nswamy for his inputs and design discussions related to this project and @frankfliu for explaining the requirements and the use case from customer perspective.

Problem Statement

One of the big un-catered for use cases in MXNet is loading a model and being able to run parallel inference on the model from multiple threads while sharing the parameters. There are multiple user requests for the same [1]. There also has been a lot of confusion around the current state of MXNet with respect to thread safety.

This doc attempts to address three things :

Tries to clarify the current state of MXNet with respect to thread safety.
Tries to give an idea of the benefits to expect from adding this feature.
Attempts to solve the problem of parallel inference by providing a multi-threaded inference API ( C APIs and frontend APIs in CPP and Python),

Current State of MXNet Thread Safety

MXNet Dependency Engine Thread Safety

Examining MXNet dependency engine code, it looks like it was designed to be thread safe. Tried to push Convolution op from multiple threads into MXNet Engine, to see if there are any issues with thread safety. Used CPP Package for the same. The script is provided here : https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/cpp-package/example/multithreading_engine_push_mxnet_op.cpp

./build/cpp-package/example/multithreading_engine_push_mxnet_op 2

The script pushes Convolution op to the engine from multiple threads. You can verify the correctness of the op with this script :
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/test_cached_op_ts_check.py

python3 test_cached_op_ts_check.py

MXNet Graph Executor Thread Safety

Removed NaiveEngine only restriction for C Predict API and tried to run multi threaded inference with C Predict API using ThreadedEngine by commenting the check : https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/src/c_api/c_predict_api.cc

When running this example the program core dumps with memory leaks in Graph Executor Bind. This shows that graph executor is not thread safe.

Cached Op (Gluon Backend) Thread Safety

Try to create cached op in the main thread and spawn multiple threads to invoke the same cached op inside each of the threads. Here is the script which does the same : https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/cpp-package/example/multithreading_engine_push_cached_op.cpp

# Usage
./build/cpp-package/example/multithreading_engine_push_cached_op <num_threads> <context> <thread_safe>

# Example
./build/cpp-package/example/multithreading_engine_push_cached_op 20 cpu 0 // uses cached op available in master

Multiple failures seen when I run this: one is in the dmlc ThreadLocalStore [2], other is in MXPlanMemory, retrieving forward_ref_count attribute. These errors are because of race condition w.r.t reading and writing of shared states in CachedOp.

Proposed Solution

Additions (Prioritized for 1.6)

Proposing to add a minimal thread safe cached op for inference which will be the following :

Similar to cached op, except it supports only inference use cases.
Doesn’t support inlining, dynamic shapes, bulking, static alloc.
Use static thread_local variables for GraphInfo which maintains the fwd_graph state, buff which maintains all ndarray states and for op_states. [ There is scope for additional optimization here w.r.t separation of buffers for inputs and params]
The above addition means that we can instantiate only one thread safe cached op per process. The frontend API for SymbolBlockThreadSafe needs to be a singleton because of this limitation.

C API Changes (Prioritized for 1.6)

Adding a new thread_safe flag for MXCreateCachedOpEx. When set to true this should create a thread_safe cached op instead of a cached op.

  /*!
   * \brief create cached operator
   */
  MXNET_DLL int MXCreateCachedOpEx(SymbolHandle handle,
                                   int num_flags,
                                   const char** keys,
                                   const char** vals,
                                   CachedOpHandle *out,
                                   bool thread_safe = false);

Add similar thread_safe flag flags to Invoke and Free to invoke thread safe cached op versions instead of the default versions.

  /*!
   * \brief invoke a cached op
   * \param handle the handle to the cached op
   * \param num_inputs number of input NDArrays
   * \param inputs input NDArrays
   * \param num_outputs number of output NDArrays
   * \param outputs output NDArrays
   * \param out_stypes output ndarrays' stypes
   * \param thread_safe whether to invoke thread safe version of cached op.
   * \return 0 when success, -1 when failure happens
   */
    
      
  MXNET_DLL int MXInvokeCachedOpEx(CachedOpHandle handle,
                                   int num_inputs,
                                   NDArrayHandle *inputs,
                                   int *num_outputs,
                                   NDArrayHandle **outputs,
                                   const int** out_stypes,
                                   bool thread_safe = false);

    
                                   
  /*!
   * \brief free cached operator
   */
  MXNET_DLL int MXFreeCachedOp(CachedOpHandle handle, bool thread_safe = false);

Please see the PoC here for details:

Thread Safe Cached Op Code : https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/src/imperative/cached_op_threadsafe.h , https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/src/imperative/cached_op_threadsafe.cc
Example Code for invoking Cached Op inference from multiple threads : https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/cpp-package/example/multithreading_engine_push_cached_op.cpp

# Usage
./build/cpp-package/example/multithreading_engine_push_cached_op <num_threads> <context> <thread_safe>

# Example
./build/cpp-package/example/multithreading_engine_push_cached_op 20 cpu 1

Use Cases Tested:

Create cached op with a single op (Convolution) from main thread. Spawn additional threads and invoke cached op from each thread.
Create cached op with a full model (resnet-18) from main thread. Spawn additional threads and invoke cached op from each thread.

CPP Frontend Changes (Priority for 1.6)

Add a singleton SymbolBlock (ThreadSafe version) with an imports API like in python, targeted for Inference.
Params will be loaded using ndarray module.
Initially only one context is supported but this can be extended to multi context.
Forward call will invoke CachedOp passing the input ndarrays and param ndarrays.
Initially sparse storage types won’t be supported and casting won’t be supported.
Will be added to the contrib API.

@access2rohit will be helping me with the CPP API changes.

Python Frontend Changes (Lower Priority, Post 1.6)

Add a SymbolBlock (threadsafe version, singleton) inheriting the SymbolBlock with imports and forward API.
Here is a PoC: https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/python/mxnet/gluon/contrib/block.py and an example of how to call it : https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/test_symbolblock_cached_op_ts.py
The PoC is currently not functioning and hangs randomly. This could be because of WaitForVar and WaitForAll thread safety issues and/or the cross device copy thread safety issues and/or issues with usage of python thread local. This requires some more investigation.

Existing Issues

dmlc-core ThreadLocalStore issue[2] . Reverting back to MX_THREAD_LOCAL fixes the issue but need to explore additional downsides of reverting back. (HIGH PRIORITY FOR 1.6) : Addressed in (Build dmlc-core with old thread_local implementation for MXNet #16526)
WaitForVar and WaitForAll are not thread safe. (HIGH PRIORITY FOR 1.6). [3]
Python API Issues mentioned above. (LOWER PRIORITY, POST 1.6).

Expected Benefits

One big benefit is being able to run inference on the same model with shared params from multiple threads. Current approach is to use multiprocessing library and import mxnet in each process. This saves a lot of memory footprint and improves the throughput for inference on a single machine. To obtain some numbers I wrote a multiprocessing script in python to load model and run inference from multiple processes.

Please see here for the python script : https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/test_symbolblock_cached_op_ts.py
This runs out of memory with 12 parallel inferences.

When running the same model inference on CPP, please see example here : https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/cpp-package/example/multithreading_engine_push_cached_op_full_model.cpp

# Usage 
./build/cpp-package/example/multithreading_engine_push_cached_op_full_model <num_threads> <context>

# Example
./build/cpp-package/example/multithreading_engine_push_cached_op_full_model 20 cpu

This is able to run more than 960 parallel inferences though there is an increased latency with higher number of parallel inferences.

Model Coverage

Models Tested	MKLDNN	CUDNN	NO-CUDNN
resnet-18	Yes	Yes	Yes

This is a work in progress list and more models will be added to this list.

What will not be supported for 1.6 ?

Since, this is a new interface where many things can go wrong, we are starting small here and will incrementally add support. Lot of these features may just work but requires some effort with verification and won't be feasible for 1.6.

Only operators tested with the existing model coverage are supported. Other operators (stateful operators, custom operators) not supported.
Only dense storage types supported currently.
Multi GPU inference not supported currently.
Instantiating multiple instances of SymbolBlockThreadsafe is not supported. Can run parallel inference only on one model per process.
dynamic shapes not supported.
static_alloc and static_shape not supported.
Bulking of ops is not supported.
This is only for inference use cases, backward pass/training use cases not supported.
graph rewrites with subgraph api currently not supported.
Python Frontend Changes

References

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-10-10T22:29:48Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Feature

marcoabreu · 2019-10-10T22:42:05Z

Great proposal!

Few questions from my end:

Will the new C-API functions be threadsafe in general? Speak, I can invoke them at any point in time from any thread without the need of a lock, sticky-thread or a thread hierarchy? (I'm thinking of the thread-safety being done on the backend level)
Will this also support the GPU use-case? Speak, the parameters are only copied into GPU memory once in the same fashion as you're describing for the CPU?
Do you think there's a path forward to make all inference-related C-APIs threadsafe instead of splitting off another execution branch?

anirudh2290 · 2019-10-10T23:09:53Z

Thanks @marcoabreu !

Will the new C-API functions be threadsafe in general? Speak, I can invoke them at any point in time from any thread without the need of a lock, sticky-thread or a thread hierarchy? (I'm thinking of the thread-safety being done on the backend level)

The issue I found with C API thread safety especially with the cached op use case was the ThreadLocalStore. If we fix this issue then C APIs related to CreateCachedOp and InvokeCachedOp should be threadsafe.

Will this also support the GPU use-case? Speak, the parameters are only copied into GPU memory once in the same fashion as you're describing for the CPU?

This should still support the single GPU use-case for 1.6. Multi GPU inference use case requires more verification at the cached op level .

Do you think there's a path forward to make all inference-related C-APIs threadsafe instead of splitting off another execution branch?

I don't think we have such a strict split between inference and training APIs at the C API level. For example for gluon cached op we call InvokeCachedOp for both training and Inference.

But if I rephrase your question to:
Will I be able to do multi threaded inference from every frontend API which I can use to do inference today ?
Right now, I am targeting only gluon since most users have been directed towards gluon. The other ways are using module, symbolic and using C Predict API. To support these two frontend APIs requires the graph executor to be thread safe. This would definitely be a great add for MXNet since it would ensure that they can do multi-threaded inference from any of these APIs in MXNet, but not something I have planned for currently.

ddavydenko · 2019-10-14T18:47:08Z

@mxnet-label-bot , add [Feature]

ptrendx · 2019-10-23T23:42:11Z

Hi @anirudh2290, what is the status of this proposal? When do you think changes will be ready?

anirudh2290 · 2019-10-24T04:44:05Z

@ptrendx I am trying to open a PR by Friday. On the status : the two prereqs issues dmlc/dmlc-core#573 and #16434 have been better understood and fixed/worked around. I have made C API and backend changes and currently still testing it.

Because of time and resource constraints I won't be able to add the CPP frontend changes (which has been mentioned in this PR as targeted for 1.6) in this proposal but only C API changes, backend changes and tests/verification.

arcadiaphy · 2019-12-05T09:45:39Z

@anirudh2290 Just see this RFC. Let me share what I've done in multithreaded infererce, I think it's the only viable way now in mxnet.

I've deployed many models with scala API, and run them in multiple threads. The whole system has run smoothly in production environment for more than 2 months.

The backend of inference is graph executor, which is created for each thread with shared model parameters. The executors can be dynamically reshaped in each thread independently according to the shape of the data input.

Like what's mentioned above, the dependency engine is not thread safe, so if you run it in threaded engine, dead lock and core dump will happen. Therefore, naive engine is the only option left. Without the dependency scheduling, any write dependency on model parameters is likely to be executed simultaneously and mess the internal data. If mkldnn is used to accelerate inference, you will get non-deterministic results per inference because mxnet stealthily reorder the data in ndarray (write dependency involved) for mkldnn operators. I've used a temporary method to address this issue which is not suitable for an official PR.

Multithreaded inference should be used with caution. Sharing model parameters can reduce the memory footprint in your program, but a lot of memory usage is consumed by global resources (temporary workspace, random number generator, ...) or op cache for mkldnn which are stored in static thread_local variables. So thread number is the most important factor for memory footprint, any thread involving mxnet operation, be it any trivial imperative invoking of operators, will incur memory overhead by creating its own set of thread_local variables. I've spent so much time tracking down memory leak and the best solution is to limit thread number.

A new method to do multithreaded inference by threaded engine is much welcomed here. It will solve the above issues automatically and ensure result correctness by enforcing dependency checking.

anirudh2290 · 2019-12-05T21:52:22Z

Thanks for the thoughtful and valuable comments @arcadiaphy.

I've deployed many models with scala API, and run them in multiple threads. The whole system has run smoothly in production environment for more than 2 months.

The backend of inference is graph executor, which is created for each thread with shared model parameters. The executors can be dynamically reshaped in each thread independently according to the shape of the data input.

Yes, if I am not mistaken this is very similar to how the C Predict API supports multi threaded inference today.

Like what's mentioned above, the dependency engine is not thread safe, so if you run it in threaded engine, dead lock and core dump will happen. Therefore, naive engine is the only option left. Without the dependency scheduling, any write dependency on model parameters is likely to be executed simultaneously and mess the internal data. If mkldnn is used to accelerate inference, you will get non-deterministic results per inference because mxnet stealthily reorder the data in ndarray (write dependency involved) for mkldnn operators. I've used a temporary method to address this issue which is not suitable for an official PR.

This is a very useful point. In my proposal, I was concentrating mostly on ThreadedEngine and not NaiveEngine. Though, recently I added tests for NaiveEngine in my PR and everything seemed to be working fine. Till now I have not been able to reproduce the correctness issue that you mention with MKLDNN (hidden write) and NaiveEngine, but it could be because the Reorder doesnt happen in the spawned thread. Here is my test: https://github.com/apache/incubator-mxnet/pull/16654/files#diff-1335fbaf3930b1438d9be18edb07a1a6R1384 . Not sure, if something changed with MKLDNN 1.0 or my test doesnt catch that use case, will dig more into this.

Multithreaded inference should be used with caution. Sharing model parameters can reduce the memory footprint in your program, but a lot of memory usage is consumed by global resources (temporary workspace, random number generator, ...) or op cache for mkldnn which are stored in static thread_local variables. So thread number is the most important factor for memory footprint, any thread involving mxnet operation, be it any trivial imperative invoking of operators, will incur memory overhead by creating its own set of thread_local variables. I've spent so much time tracking down memory leak and the best solution is to limit thread number.

A new method to do multithreaded inference by threaded engine is much welcomed here. It will solve the above issues automatically and ensure result correctness by enforcing dependency checking.

Yes, the earlier approach which has one graph executor per thread, may have a lot of memory consumption for global resources. Sharing the cached op will alleviate the pain. As you know, we still have a lot of customers using graph executor as the backend. Would be a great add, if you are interested to contribute towards making graph executor also thread safe for inference use cases.

HeshanMountteen · 2021-05-25T08:58:38Z

@anirudh2290 hi, I use mutithreading C++ API and find that the change between NDArrayHandle* to NDArray cost too much time(about 50ms). My prvious pred time about the model just cost 20-30ms. Is it normal?

HeshanMountteen · 2021-05-25T09:01:20Z

though there exsists async, it really cost too much that against my will to use multithread

mseth10 · 2021-05-26T10:37:59Z

cc @josephevans @Zha0q1

HeshanMountteen · 2021-05-27T09:52:51Z

I think is there any method can be replaced with data output, like vector<mx_float>.

HeshanMountteen · 2021-06-03T01:38:21Z

ok, I have found float* format output width method MxPred... works.

anirudh2290 mentioned this issue Oct 11, 2019

WaitForVar in dependency engine not thread safe #16434

Open

zachgk added Feature request RFC Post requesting for comments Thread Safety labels Oct 14, 2019

samskalicky mentioned this issue Nov 27, 2019

Anirudh Subramanian's changes for multithreaded inference #16927

Closed

7 tasks

anirudh2290 mentioned this issue Dec 4, 2019

Multithreaded Inference Support #16654

Merged

4 tasks

kevmal mentioned this issue Jan 29, 2020

DQN example with MKL build error kevmal/MXNetSharp#55

Open

mrodozov mentioned this issue Jun 26, 2020

Problem with mxnet on CC8 cms-sw/cmssw#30432

Closed

szha mentioned this issue Aug 9, 2020

用C接口初始化一个模型，然后多线程去调用这个模型出bug #18476

Open

HeshanMountteen mentioned this issue May 13, 2021

About muti-threads deepinsight/insightface#1511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] MXNet Multithreaded Inference Interface #16431

[RFC] MXNet Multithreaded Inference Interface #16431

anirudh2290 commented Oct 10, 2019 •

edited

Loading

mxnet-label-bot commented Oct 10, 2019

marcoabreu commented Oct 10, 2019

anirudh2290 commented Oct 10, 2019

ddavydenko commented Oct 14, 2019

ptrendx commented Oct 23, 2019

anirudh2290 commented Oct 24, 2019

arcadiaphy commented Dec 5, 2019 •

edited

Loading

anirudh2290 commented Dec 5, 2019

HeshanMountteen commented May 25, 2021

HeshanMountteen commented May 25, 2021

mseth10 commented May 26, 2021

HeshanMountteen commented May 27, 2021

HeshanMountteen commented Jun 3, 2021

[RFC] MXNet Multithreaded Inference Interface #16431

[RFC] MXNet Multithreaded Inference Interface #16431

Comments

anirudh2290 commented Oct 10, 2019 • edited Loading

Problem Statement

Current State of MXNet Thread Safety

MXNet Dependency Engine Thread Safety

MXNet Graph Executor Thread Safety

Cached Op (Gluon Backend) Thread Safety

Proposed Solution

Additions (Prioritized for 1.6)

C API Changes (Prioritized for 1.6)

Please see the PoC here for details:

Use Cases Tested:

CPP Frontend Changes (Priority for 1.6)

Python Frontend Changes (Lower Priority, Post 1.6)

Existing Issues

Expected Benefits

Model Coverage

What will not be supported for 1.6 ?

References

mxnet-label-bot commented Oct 10, 2019

marcoabreu commented Oct 10, 2019

anirudh2290 commented Oct 10, 2019

ddavydenko commented Oct 14, 2019

ptrendx commented Oct 23, 2019

anirudh2290 commented Oct 24, 2019

arcadiaphy commented Dec 5, 2019 • edited Loading

anirudh2290 commented Dec 5, 2019

HeshanMountteen commented May 25, 2021

HeshanMountteen commented May 25, 2021

mseth10 commented May 26, 2021

HeshanMountteen commented May 27, 2021

HeshanMountteen commented Jun 3, 2021

anirudh2290 commented Oct 10, 2019 •

edited

Loading

arcadiaphy commented Dec 5, 2019 •

edited

Loading