-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust the threading logic in ThreadPool::ParallelFor #3178
Conversation
Until we've settled down on whether to use the main thread or not, do you think it'll be easier to make it an option that can be configured at runtime for faster experimentation? When we become happy with a given setting we can remove it from the options. |
Is using the main thread detrimental if total is less than the number of threads in the threadpool? |
It gets better performance to use main thread. Why don't we create a threadpool with Intra_num_thread -1 threads and use main thread as one of them? |
Take main thread as one thread of thread pool is also in accord with our current logic:
|
Because I have another change to enable adaptive scheduling.
In that setting, it may schedule more tasks than the number of threads in ParallelFor. To make it work, the main thread can't be used for computation. |
Indeed, this PR will make it better and clear. In the latest master branch, if you have two CPUs, then you can only use one of them. |
Sorry I don't have the data. |
For the threadpool, it's easy, But we can't do it for mlas. Mlas doesn't have a SessionOption. |
Description:
This is the second try of #1839
Motivation and Context
First, in our current code, the default setting of session thread pool is far from optimal. For example, to make
resnet models get the best perf, you should set thread count to 3, not 4. By default is 4. The default one is 30%-70% slower than the optimal one.
For example, in the master branch, on my local host, the mlperf resnet50 model's latency is:
4 threads: 23.8403 ms
3 threads: 14.4022 ms
Second, the ThreadPool::ParallelFor implementation doesn't work well when the value of 'total' is larger than number of threads in the thread pool. It's not necessary to fix but if I only revert the change in mlasi.h without changing the code in ThreadPool::ParallelFor, the perf will drop dramatically. I don't quite understand why it happens. I guess it's related to the spinlock in Eigen threadpool.
If I change both of them, the perf impact is mostly ok.
(The old one is based on thread count=3, not 4. Otherwise all the models get much better.)
It still hurts performance on some of the models, but the others are neutral.
See more information at #1646 #1873
Regarding to #1873, I'll make it as an option in SessionOptions.