-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Don't multiply YieldProcessor count by proc count #13556
Conversation
Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388 - Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better. - The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts. - I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.
Interesting; I wonder... I have some threading tests that noise dive as proc counts increase; always assumed it was straight contention; rather than multiplicative contention. Might give them a retest, |
SpinWait also keeps increasing its
coreclr/src/mscorlib/src/System/Threading/SpinLock.cs Lines 354 to 365 in 6aed5a1
|
I have some tests that have contention in the CLR + unmanaged contention that nose dive at high core counts (testing on 24 core/48 threads) that cause a big slowdown in throughput, I wonder if there is an interaction here. |
Also coreclr/src/vm/amd64/JitHelpers_InlineGetThread.asm Lines 488 to 500 in 6aed5a1
Lines 143 to 155 in 2fd8b30
coreclr/src/vm/amd64/JitHelpers_Slow.asm Lines 990 to 1003 in 6aed5a1
Not sure what Lines 145 to 153 in 46ab1d1
Which would now be 48 * 140 * 20000 = 134,400,000 cycles on a 48 core skylake with these new numbers? |
Only seems to use proc count for max duration
coreclr/src/inc/clrconfigvalues.h Lines 673 to 681 in aed0665
|
Yea monitor will need to be fixed as well, even the starting spin count is very large and multiplies very quickly to a max that is based on proc count. Once I have a scaling equivalent of YieldProcessor, all the spins that use it would need to be retuned once. Probably best to do them one by one. |
Any idea when this is going to be available for testing on nightly build? |
Here are the perf numbers, left is baseline and right is with this change. Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):
The slight regression on 64 * proc count threads is minor and going to be fixed in my next PR. Core i7-6700 (Skylake, 4-core, 8-thread):
The spin loop is currently optimized for older processors, that'll be mostly fixed in my next PR as well. @stephentoub, could you please take a look? |
@kouvel from what I see on our data, it looks like there is even more possibilities to improvement at lower levels, but not having access to the code at kernel level this is just a guess. This same effects may be added on top of the kernel own spinning. Do not know anyone at the Kernel to ask, maybe you do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @kouvel. I think it's fine. The testing you've done looks good, and these two paths are only used when blocking; I'll be more interesting in / want more scrutiny on any changes we make around spinning in some places where it's not about blocking, e.g. SpinWait, which is used by various concurrent data structures, by the thread pool, etc.
@redknightlois, I'm not seeing any time spent in the kernel on the Skylake processor (in the worst case scenario above). There might be some context switching happening in your scenario. You could try to profile with PerfView, include "Thread Time" and add the public symbol server to the symbol paths to get some functions names. |
Thanks @stephentoub, my next change will be retuning SpinWait and normalizing Thread.SpinWait and I'm currently doing more testing on that. |
@kouvel Maybe it is a side effect. Let's see if it repeats after your changes hit nightly. |
@kouvel this PR was marked as servicing (2.0.x), but it was done against master branch (i.e. 2.1). |
Will do. @redknightlois, would you be able to confirm whether this helps with the issue you ran into? |
@kouvel I was waiting for it to show up on the nightly build of the 2.0 branch. :( |
Oh sorry, I'll put up a PR for that |
Port of dotnet#13556 to release/2.0.0 Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388 Fixes https://github.com/dotnet/coreclr/issues/13630 - Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better. - The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts. - I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.
We have been avoiding spin-based synchronization primitives for a while now, because we've seen that they perform poorly on servers with a large number of cores (32 cores in our case). Especially |
@redknightlois, it looks like the 2.0.0 branch is currently closed for 2.0.1. We'll have to wait until it reopens for the next release. |
@kouvel Let me know as soon as it is included. Probably I would be able to test this on 2.1 ... cause I have to test a few other optimizations that were done at the JIT. |
Don't multiply YieldProcessor count by proc count (ManualResetEventSlim, Task) (#13556)
Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388
I used two perf tests to verify the change. The first test is sort of a latency test, one thread is constantly setting the event and resetting the event in a loop (with slight modifications to add different types of delays in certain places, which are not in the code below), and a bunch of wait threads are constantly waiting on the event and incrementing a counter on each success. The test measures how frequently the signal is seen by other threads before the event is reset, and so favors scenarios that do a light amount of spinning with little delay between iterations. Code:
The second test measures how long it takes for all waiting threads to see the signal. One thread signals the event and all threads wait until all waiting threads see the signal. Then the former thread signals the event again, and so on. This test favors longer-duration spin loops that don't enter a wait state before succeeding, for instance too-early waiting (disabling spinning for instance) causes excessive context switching that causes this test to crawl, that's mainly what I was looking for. Code:
I saw that simulating a longer PAUSE by multiplying the YieldProcessor by 14 (including the multiply by proc count) caused especially the second test to crawl even at proc count threads. In both tests, removing the multiply significantly improved the score.