Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the threading logic in ThreadPool::ParallelFor #3178

Merged
merged 1 commit into from
Mar 12, 2020
Merged

Conversation

snnn
Copy link
Member

@snnn snnn commented Mar 11, 2020

Description:

  1. Do not reuse the main thread.
  2. Do not plus one when mlas calculate the number of tasks to schedule. (It was me put the plus one there)

This is the second try of #1839

Motivation and Context

  • Why is this change required? What problem does it solve?
    First, in our current code, the default setting of session thread pool is far from optimal. For example, to make
    resnet models get the best perf, you should set thread count to 3, not 4. By default is 4. The default one is 30%-70% slower than the optimal one.
    For example, in the master branch, on my local host, the mlperf resnet50 model's latency is:
    4 threads: 23.8403 ms
    3 threads: 14.4022 ms

Second, the ThreadPool::ParallelFor implementation doesn't work well when the value of 'total' is larger than number of threads in the thread pool. It's not necessary to fix but if I only revert the change in mlasi.h without changing the code in ThreadPool::ParallelFor, the perf will drop dramatically. I don't quite understand why it happens. I guess it's related to the spinlock in Eigen threadpool.

If I change both of them, the perf impact is mostly ok.

model NEW_CHANGE OLD % DIFF
inception_v2 0.010315 0.009523 8
mlperf_mobilenet 0.003113 0.00281 10
squeezenet 0.002565 0.002291 11

(The old one is based on thread count=3, not 4. Otherwise all the models get much better.)

It still hurts performance on some of the models, but the others are neutral.

See more information at #1646 #1873

Regarding to #1873, I'll make it as an option in SessionOptions.

  • If it fixes an open issue, please link to the issue here.

@snnn snnn requested a review from a team as a code owner March 11, 2020 19:23
@pranavsharma
Copy link
Contributor

Until we've settled down on whether to use the main thread or not, do you think it'll be easier to make it an option that can be configured at runtime for faster experimentation? When we become happy with a given setting we can remove it from the options.

@skottmckay
Copy link
Contributor

Is using the main thread detrimental if total is less than the number of threads in the threadpool?

@yufenglee
Copy link
Member

It gets better performance to use main thread. Why don't we create a threadpool with Intra_num_thread -1 threads and use main thread as one of them?

@yufenglee
Copy link
Member

Take main thread as one thread of thread pool is also in accord with our current logic:
return thread_pool_size == 1 ? nullptr : onnxruntime::make_uniqueconcurrency::ThreadPool(name, thread_pool_size);

return thread_pool_size == 1 ? nullptr : onnxruntime::make_unique<concurrency::ThreadPool>(name, thread_pool_size);

@snnn
Copy link
Member Author

snnn commented Mar 11, 2020

It gets better performance to use main thread. Why don't we create a threadpool with Intra_num_thread -1 threads and use main thread as one of them?

Because I have another change to enable adaptive scheduling.

// The Adaptive scheduling strategy adaptively chooses the shard sizes based

In that setting, it may schedule more tasks than the number of threads in ParallelFor. To make it work, the main thread can't be used for computation.

@snnn
Copy link
Member Author

snnn commented Mar 11, 2020

Take main thread as one thread of thread pool is also in accord with our current logic:

Indeed, this PR will make it better and clear.
If the machine has 1 cpu, then don't create an thread pool, because it's useless.
If it has two, then create a pool with 2 threads.

In the latest master branch, if you have two CPUs, then you can only use one of them.

@snnn
Copy link
Member Author

snnn commented Mar 12, 2020

Is using the main thread detrimental if total is less than the number of threads in the threadpool?

Sorry I don't have the data.
Basically, the ultimate goal of this PR is to minimize thread migration on CPUs. Because I've observed the more often it happens, the higher cycles per instruction is. If we let the main thread do little work, put all the computations in the thread pool, and have a 1:1 mapping from the thread pool threads and cpu cores, then it should work good. Because in such setting, OS has less motivation to migrate the threads.

@snnn
Copy link
Member Author

snnn commented Mar 12, 2020

do you think it'll be easier to make it an option that can be configured at runtime for faster experimentation?

For the threadpool, it's easy, But we can't do it for mlas. Mlas doesn't have a SessionOption.

@snnn snnn merged commit a02638e into master Mar 12, 2020
@snnn snnn deleted the snnn/tp_changes1 branch March 12, 2020 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants