Don't multiply YieldProcessor count by proc count #13556

kouvel · 2017-08-23T23:00:24Z

Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388

Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.

I used two perf tests to verify the change. The first test is sort of a latency test, one thread is constantly setting the event and resetting the event in a loop (with slight modifications to add different types of delays in certain places, which are not in the code below), and a bunch of wait threads are constantly waiting on the event and incrementing a counter on each success. The test measures how frequently the signal is seen by other threads before the event is reset, and so favors scenarios that do a light amount of spinning with little delay between iterations. Code:

    private static void Main(string[] args)
    {
        int processorCount = Environment.ProcessorCount;
        int threadCount = args.Length == 0 ? processorCount : (int)uint.Parse(args[0]);

        var sw = new Stopwatch();
        var startTest = new ManualResetEvent(false);
        int waiterWaitCount = 0;
        var e = new ManualResetEventSlim(false);

        ThreadStart waitThreadStart = () =>
        {
            startTest.WaitOne();
            while (true)
            {
                e.Wait();
                Interlocked.Increment(ref waiterWaitCount);
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            startTest.WaitOne();
            while (true)
            {
                e.Set();
                e.Reset();
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        Interlocked.Exchange(ref waiterWaitCount, 0);

        // Measure

        sw.Restart();
        Thread.Sleep(500);
        sw.Stop();

        int waitCount = waiterWaitCount;

        long elapsedMs = sw.ElapsedMilliseconds;
        double score = (double)waitCount / elapsedMs;

        Console.WriteLine("Score: {0:0.000000}", score);
    }

The second test measures how long it takes for all waiting threads to see the signal. One thread signals the event and all threads wait until all waiting threads see the signal. Then the former thread signals the event again, and so on. This test favors longer-duration spin loops that don't enter a wait state before succeeding, for instance too-early waiting (disabling spinning for instance) causes excessive context switching that causes this test to crawl, that's mainly what I was looking for. Code:

    private static void Main(string[] args)
    {
        int processorCount = Environment.ProcessorCount;
        int threadCount = args.Length == 0 ? processorCount : (int)uint.Parse(args[0]);

        var sw = new Stopwatch();
        var startTest = new ManualResetEvent(false);
        var allWaitersWoken0 = new ManualResetEvent(false);
        var allWaitersWoken1 = new ManualResetEvent(false);
        int waiterWaitCount = 0;
        int waiterWokenCount = 0;
        var e = new ManualResetEventSlim(false);

        ThreadStart waitThreadStart = () =>
        {
            startTest.WaitOne();
            var allWaitersWoken = allWaitersWoken0;
            while (true)
            {
                e.Wait();
                if (Interlocked.Increment(ref waiterWokenCount) == threadCount)
                {
                    waiterWaitCount += threadCount;
                    waiterWokenCount = 0;
                    e.Reset();
                    (allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0).Reset();
                    allWaitersWoken.Set();
                }
                else
                    allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            startTest.WaitOne();
            var allWaitersWoken = allWaitersWoken0;
            while (true)
            {
                e.Set();
                allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        Interlocked.Exchange(ref waiterWaitCount, 0);

        // Measure

        sw.Restart();
        Thread.Sleep(500);
        sw.Stop();

        int waitCount = waiterWaitCount;

        long elapsedMs = sw.ElapsedMilliseconds;
        double score = (double)waitCount / elapsedMs;

        Console.WriteLine("Score: {0:0.000000}", score);
    }

I saw that simulating a longer PAUSE by multiplying the YieldProcessor by 14 (including the multiply by proc count) caused especially the second test to crawl even at proc count threads. In both tests, removing the multiply significantly improved the score.

Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388 - Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better. - The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts. - I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.

benaadams · 2017-08-23T23:27:47Z

Interesting; I wonder... I have some threading tests that noise dive as proc counts increase; always assumed it was straight contention; rather than multiplicative contention. Might give them a retest,

benaadams · 2017-08-23T23:36:39Z

SpinWait also keeps increasing its processFactor multiplier until it hits CPU count

SPINNING_FACTOR == 100

Thread.SpinWait((turn + i) * SPINNING_FACTOR * processFactor)

coreclr/src/mscorlib/src/System/Threading/SpinLock.cs

Lines 354 to 365 in 6aed5a1

    
           //lock acquired failed and waiters updated 
        
           int processorCount = PlatformHelper.ProcessorCount; 
        
           if (turn < processorCount) 
        
           { 
        
               int processFactor = 1; 
        
               for (int i = 1; i <= turn * SPINNING_FACTOR; i++) 
        
               { 
        
                   Thread.SpinWait((turn + i) * SPINNING_FACTOR * processFactor); 
        
                   if (processFactor < processorCount) 
        
                       processFactor++; 
        
                   observedOwner = m_owner; 
        
                   if ((observedOwner & LOCK_ANONYMOUS_OWNED) == LOCK_UNOWNED)

Drawaes · 2017-08-23T23:37:12Z

I have some tests that have contention in the CLR + unmanaged contention that nose dive at high core counts (testing on 24 core/48 threads) that cause a big slowdown in throughput, I wonder if there is an interaction here.

benaadams · 2017-08-24T00:02:52Z

Also lock/Monitor has a multiply on pause which you might want to check?

coreclr/src/vm/amd64/JitHelpers_InlineGetThread.asm

Lines 488 to 500 in 6aed5a1

    
                   ; If we are on an MP system, we try spinning for a certain number of iterations 
        
                   cmp     dword ptr [g_SystemInfo + OFFSETOF__g_SystemInfo__dwNumberOfProcessors], 1 
        
                   jle     FramedLockHelper 
        
                   ; Exponential backoff; delay by approximately 2*r10 clock cycles 
        
                   mov     eax, r10d 
        
               delayLoopThinLock: 
        
                   pause   ; indicate to the CPU that we are spin waiting 
        
                   sub     eax, 1 
        
                   jnz     delayLoopThinLock 
        
                   ; Next time, wait a factor longer 
        
                   imul    r10d, dword ptr [g_SpinConstants + OFFSETOF__g_SpinConstants__dwBackoffFactor]

coreclr/src/vm/syncblk.inl

Lines 143 to 155 in 2fd8b30

    
           for (DWORD i = 0; i < spincount; i++) 
        
           { 
        
               YieldProcessor(); 
        
           } 
        
           AwareLock::EnterHelperResult result = EnterObjMonitorHelper(pCurThread); 
        
           if (result != AwareLock::EnterHelperResult_Contention) 
        
           { 
        
               return result; 
        
           } 
        
           spincount *= g_SpinConstants.dwBackoffFactor; 
        
           if (spincount > g_SpinConstants.dwMaximumDuration)

coreclr/src/vm/amd64/JitHelpers_Slow.asm

Lines 990 to 1003 in 6aed5a1

    
               PrepareToWaitThinLock: 
        
                   ; If we are on an MP system, we try spinning for a certain number of iterations 
        
                   cmp     dword ptr [g_SystemInfo + OFFSETOF__g_SystemInfo__dwNumberOfProcessors], 1 
        
                   jle     FramedLockHelper 
        
                   ; Exponential backoff; delay by approximately 2*r10 clock cycles 
        
                   mov     eax, r10d 
        
               delayLoopThinLock: 
        
                   pause   ; indicate to the CPU that we are spin waiting 
        
                   sub     eax, 1 
        
                   jnz     delayLoopThinLock 
        
                   ; Next time, wait a factor longer 
        
                   imul    r10d, dword ptr [g_SpinConstants + OFFSETOF__g_SpinConstants__dwBackoffFactor]

Not sure what dwBackoffFactor is but
also max pause loop dwMaximumDuration might get quite big if its (20000 * max(2, numProc))

coreclr/src/vm/vars.cpp

Lines 145 to 153 in 46ab1d1

    
           // Configurable constants used across our spin locks 
        
           // Initialization here is necessary so that we have meaningful values before the runtime is started 
        
           // These initial values were selected to match the defaults, but anything reasonable is close enough 
        
           SpinConstants g_SpinConstants = {  
        
               50,        // dwInitialDuration  
        
               40000,     // dwMaximumDuration - ideally (20000 * max(2, numProc)) 
        
               3,         // dwBackoffFactor 
        
               10         // dwRepetitions 
        
           };

Which would now be 48 * 140 * 20000 = 134,400,000 cycles on a 48 core skylake with these new numbers?

benaadams · 2017-08-24T00:09:50Z

Only seems to use proc count for max duration

Hex value specifying the multiplier on NumProcs to use when calculating the maximum spin duration

coreclr/src/inc/clrconfigvalues.h

Lines 673 to 681 in aed0665

    
           // Spinning heuristics 
        
           // Note that these only take effect once the runtime has been started; prior to that the values hardcoded in g_SpinConstants (vars.cpp) are used 
        
           // 
        
           RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinInitialDuration, W("SpinInitialDuration"), 0x32, "Hex value specifying the first spin duration", EEConfig_default) 
        
           RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"), 0x3, "Hex value specifying the growth of each successive spin duration", EEConfig_default) 
        
           RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default) 
        
           RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcFactor, W("SpinLimitProcFactor"), 0x4E20, "Hex value specifying the multiplier on NumProcs to use when calculating the maximum spin duration", EEConfig_default) 
        
           RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitConstant, W("SpinLimitConstant"), 0x0, "Hex value specifying the constant to add when calculating the maximum spin duration", EEConfig_default) 
        
           RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinRetryCount, W("SpinRetryCount"), 0xA, "Hex value specifying the number of times the entire spin process is repeated (when applicable)", EEConfig_default)

kouvel · 2017-08-24T02:49:05Z

Yea monitor will need to be fixed as well, even the starting spin count is very large and multiplies very quickly to a max that is based on proc count. Once I have a scaling equivalent of YieldProcessor, all the spins that use it would need to be retuned once. Probably best to do them one by one.

redknightlois · 2017-08-25T15:39:25Z

Any idea when this is going to be available for testing on nightly build?

kouvel · 2017-08-25T19:45:53Z

Here are the perf numbers, left is baseline and right is with this change.

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
MresWaitDrainRate 00.5Pc                      422.83 ±0.12%    415.17 ±0.94%     -1.81%
MresWaitDrainRate 00.5Pc Delay1usBeforeSet    370.83 ±0.18%    346.04 ±0.17%     -6.69%
MresWaitDrainRate 01.0Pc                      360.16 ±0.20%    556.28 ±0.37%     54.45%
MresWaitDrainRate 01.0Pc Delay1usBeforeSet    349.43 ±0.18%    513.77 ±0.25%     47.03%
MresWaitDrainRate 02.0Pc                      510.61 ±0.12%    693.86 ±0.16%     35.89%
MresWaitDrainRate 02.0Pc Delay1usBeforeSet    476.63 ±0.24%    654.41 ±0.59%     37.30%
MresWaitDrainRate 04.0Pc                      568.13 ±0.24%    735.37 ±0.68%     29.44%
MresWaitDrainRate 04.0Pc Delay1usBeforeSet    548.97 ±0.23%    712.59 ±0.37%     29.80%
MresWaitDrainRate 16.0Pc                      431.19 ±1.36%    983.79 ±0.41%    128.16%
MresWaitDrainRate 16.0Pc Delay1usBeforeSet    432.95 ±1.83%    963.60 ±0.90%    122.57%
MresWaitDrainRate 64.0Pc                      692.40 ±0.19%    602.29 ±0.23%    -13.02%
MresWaitDrainRate 64.0Pc Delay1usBeforeSet    695.47 ±0.23%    602.50 ±0.24%    -13.37%
MresWaitLatency 0.5Pc                        6339.00 ±0.52%  21863.24 ±0.38%    244.90%
MresWaitLatency 0.5Pc Delay1usBeforeSet      3081.19 ±1.14%  12209.38 ±0.36%    296.26%
MresWaitLatency 1.0Pc                       17364.42 ±8.99%  30075.79 ±0.95%     73.20%
MresWaitLatency 1.0Pc Delay1usBeforeSet     33023.26 ±0.67%  35936.94 ±2.12%      8.82%
------------------------------------------  ---------------  ---------------  ---------
Total                                        1026.01 ±1.05%   1536.96 ±0.57%     49.80%

The slight regression on 64 * proc count threads is minor and going to be fixed in my next PR.

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score        Right score      ∆ Score %
------------------------------------------  ----------------  ---------------  ---------
MresWaitDrainRate 00.5Pc                       389.50 ±0.25%    399.08 ±0.54%      2.46%
MresWaitDrainRate 00.5Pc Delay1usBeforeSet     395.61 ±0.21%    401.28 ±0.46%      1.43%
MresWaitDrainRate 01.0Pc                        56.17 ±0.50%    329.86 ±0.44%    487.28%
MresWaitDrainRate 01.0Pc Delay1usBeforeSet      56.13 ±0.29%    328.83 ±0.45%    485.79%
MresWaitDrainRate 02.0Pc                        97.80 ±1.37%    632.89 ±0.14%    547.10%
MresWaitDrainRate 02.0Pc Delay1usBeforeSet     104.50 ±2.49%    636.15 ±0.14%    508.77%
MresWaitDrainRate 04.0Pc                       120.89 ±0.03%    713.66 ±0.15%    490.36%
MresWaitDrainRate 04.0Pc Delay1usBeforeSet     119.04 ±0.14%    710.15 ±0.04%    496.56%
MresWaitDrainRate 16.0Pc                       173.55 ±0.25%    887.21 ±0.19%    411.22%
MresWaitDrainRate 16.0Pc Delay1usBeforeSet     173.79 ±0.23%    894.01 ±0.24%    414.44%
MresWaitDrainRate 64.0Pc                       185.82 ±0.19%    978.36 ±0.12%    426.50%
MresWaitDrainRate 64.0Pc Delay1usBeforeSet     186.14 ±0.02%    974.15 ±0.24%    423.35%
MresWaitLatency 0.5Pc                         1310.54 ±1.16%   8461.05 ±0.71%    545.62%
MresWaitLatency 0.5Pc Delay1usBeforeSet       1257.14 ±1.03%   8400.90 ±0.65%    568.25%
MresWaitLatency 1.0Pc                       11191.01 ±18.41%  29964.58 ±4.80%    167.76%
MresWaitLatency 1.0Pc Delay1usBeforeSet     10193.91 ±20.00%  25211.04 ±4.05%    147.31%
------------------------------------------  ----------------  ---------------  ---------
Total                                          322.96 ±3.13%   1364.19 ±0.85%    322.40%

The spin loop is currently optimized for older processors, that'll be mostly fixed in my next PR as well.

@stephentoub, could you please take a look?

redknightlois · 2017-08-25T19:55:50Z

@kouvel from what I see on our data, it looks like there is even more possibilities to improvement at lower levels, but not having access to the code at kernel level this is just a guess. This same effects may be added on top of the kernel own spinning.

Do not know anyone at the Kernel to ask, maybe you do.

stephentoub

Thanks, @kouvel. I think it's fine. The testing you've done looks good, and these two paths are only used when blocking; I'll be more interesting in / want more scrutiny on any changes we make around spinning in some places where it's not about blocking, e.g. SpinWait, which is used by various concurrent data structures, by the thread pool, etc.

kouvel · 2017-08-25T20:13:20Z

@redknightlois, I'm not seeing any time spent in the kernel on the Skylake processor (in the worst case scenario above). There might be some context switching happening in your scenario. You could try to profile with PerfView, include "Thread Time" and add the public symbol server to the symbol paths to get some functions names.

kouvel · 2017-08-25T20:14:13Z

Thanks @stephentoub, my next change will be retuning SpinWait and normalizing Thread.SpinWait and I'm currently doing more testing on that.

redknightlois · 2017-08-25T20:25:10Z

@kouvel Maybe it is a side effect. Let's see if it repeats after your changes hit nightly.

karelz · 2017-08-28T17:55:53Z

@kouvel this PR was marked as servicing (2.0.x), but it was done against master branch (i.e. 2.1).
If you want to bring it into servicing, please file a tracking issue and check with .NET Core shiproom. Thanks!

kouvel · 2017-08-28T17:59:52Z

Will do. @redknightlois, would you be able to confirm whether this helps with the issue you ran into?

redknightlois · 2017-08-28T19:46:10Z

@kouvel I was waiting for it to show up on the nightly build of the 2.0 branch. :(

kouvel · 2017-08-28T19:47:40Z

Oh sorry, I'll put up a PR for that

Port of dotnet#13556 to release/2.0.0 Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388 Fixes https://github.com/dotnet/coreclr/issues/13630 - Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better. - The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts. - I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.

kevingosse · 2017-08-29T16:23:10Z

We have been avoiding spin-based synchronization primitives for a while now, because we've seen that they perform poorly on servers with a large number of cores (32 cores in our case). Especially Task.Wait, which spins like crazy and eats unreasonable amounts of CPU. I'm eager to see this PR merged :)

kouvel · 2017-08-29T22:23:09Z

@redknightlois, it looks like the 2.0.0 branch is currently closed for 2.0.1. We'll have to wait until it reopens for the next release.

redknightlois · 2017-08-30T14:27:17Z

@kouvel Let me know as soon as it is included. Probably I would be able to test this on 2.1 ... cause I have to test a few other optimizations that were done at the JIT.

Don't multiply YieldProcessor count by proc count (ManualResetEventSlim, Task) (#13556)

kouvel added the area-System.Threading label Aug 23, 2017

kouvel added this to the 2.0.1 milestone Aug 23, 2017

kouvel self-assigned this Aug 23, 2017

kouvel requested a review from stephentoub August 23, 2017 23:00

dnfclas added the cla-already-signed label Aug 23, 2017

karelz modified the milestones: 2.0.x, 2.0.1 Aug 25, 2017

stephentoub approved these changes Aug 25, 2017

View reviewed changes

kouvel merged commit 29d740e into dotnet:master Aug 25, 2017

kouvel deleted the MresSpinFix branch August 25, 2017 20:15

karelz modified the milestones: 2.1.0, 2.0.x Aug 28, 2017

kouvel mentioned this pull request Aug 28, 2017

Don't multiply YieldProcessor count by proc count (ManualResetEventSlim, Task) (#13556) #13631

Merged

weshaggard added a commit that referenced this pull request Sep 8, 2017

Merge pull request #13631 from kouvel/MresSpinFixRel20

d0be5fb

Don't multiply YieldProcessor count by proc count (ManualResetEventSlim, Task) (#13556)

kouvel mentioned this pull request Jan 31, 2020

ManualResetEventSlim's waits have very high latency on Skylake processors dotnet/runtime#8821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't multiply YieldProcessor count by proc count #13556

Don't multiply YieldProcessor count by proc count #13556

kouvel commented Aug 23, 2017 •

edited

Loading

benaadams commented Aug 23, 2017

benaadams commented Aug 23, 2017

Drawaes commented Aug 23, 2017

benaadams commented Aug 24, 2017

benaadams commented Aug 24, 2017

kouvel commented Aug 24, 2017 •

edited

Loading

redknightlois commented Aug 25, 2017

kouvel commented Aug 25, 2017

redknightlois commented Aug 25, 2017

stephentoub left a comment

kouvel commented Aug 25, 2017

kouvel commented Aug 25, 2017

redknightlois commented Aug 25, 2017

karelz commented Aug 28, 2017

kouvel commented Aug 28, 2017

redknightlois commented Aug 28, 2017

kouvel commented Aug 28, 2017

kevingosse commented Aug 29, 2017

kouvel commented Aug 29, 2017

redknightlois commented Aug 30, 2017

Don't multiply YieldProcessor count by proc count #13556

Don't multiply YieldProcessor count by proc count #13556

Conversation

kouvel commented Aug 23, 2017 • edited Loading

benaadams commented Aug 23, 2017

benaadams commented Aug 23, 2017

Drawaes commented Aug 23, 2017

benaadams commented Aug 24, 2017

benaadams commented Aug 24, 2017

kouvel commented Aug 24, 2017 • edited Loading

redknightlois commented Aug 25, 2017

kouvel commented Aug 25, 2017

redknightlois commented Aug 25, 2017

stephentoub left a comment

Choose a reason for hiding this comment

kouvel commented Aug 25, 2017

kouvel commented Aug 25, 2017

redknightlois commented Aug 25, 2017

karelz commented Aug 28, 2017

kouvel commented Aug 28, 2017

redknightlois commented Aug 28, 2017

kouvel commented Aug 28, 2017

kevingosse commented Aug 29, 2017

kouvel commented Aug 29, 2017

redknightlois commented Aug 30, 2017

kouvel commented Aug 23, 2017 •

edited

Loading

kouvel commented Aug 24, 2017 •

edited

Loading