Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Don't multiply YieldProcessor count by proc count #13556

Merged
merged 1 commit into from
Aug 25, 2017

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Aug 23, 2017

Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388

  • Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
  • The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
  • I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.

I used two perf tests to verify the change. The first test is sort of a latency test, one thread is constantly setting the event and resetting the event in a loop (with slight modifications to add different types of delays in certain places, which are not in the code below), and a bunch of wait threads are constantly waiting on the event and incrementing a counter on each success. The test measures how frequently the signal is seen by other threads before the event is reset, and so favors scenarios that do a light amount of spinning with little delay between iterations. Code:

    private static void Main(string[] args)
    {
        int processorCount = Environment.ProcessorCount;
        int threadCount = args.Length == 0 ? processorCount : (int)uint.Parse(args[0]);

        var sw = new Stopwatch();
        var startTest = new ManualResetEvent(false);
        int waiterWaitCount = 0;
        var e = new ManualResetEventSlim(false);

        ThreadStart waitThreadStart = () =>
        {
            startTest.WaitOne();
            while (true)
            {
                e.Wait();
                Interlocked.Increment(ref waiterWaitCount);
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            startTest.WaitOne();
            while (true)
            {
                e.Set();
                e.Reset();
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        Interlocked.Exchange(ref waiterWaitCount, 0);

        // Measure

        sw.Restart();
        Thread.Sleep(500);
        sw.Stop();

        int waitCount = waiterWaitCount;

        long elapsedMs = sw.ElapsedMilliseconds;
        double score = (double)waitCount / elapsedMs;

        Console.WriteLine("Score: {0:0.000000}", score);
    }

The second test measures how long it takes for all waiting threads to see the signal. One thread signals the event and all threads wait until all waiting threads see the signal. Then the former thread signals the event again, and so on. This test favors longer-duration spin loops that don't enter a wait state before succeeding, for instance too-early waiting (disabling spinning for instance) causes excessive context switching that causes this test to crawl, that's mainly what I was looking for. Code:

    private static void Main(string[] args)
    {
        int processorCount = Environment.ProcessorCount;
        int threadCount = args.Length == 0 ? processorCount : (int)uint.Parse(args[0]);

        var sw = new Stopwatch();
        var startTest = new ManualResetEvent(false);
        var allWaitersWoken0 = new ManualResetEvent(false);
        var allWaitersWoken1 = new ManualResetEvent(false);
        int waiterWaitCount = 0;
        int waiterWokenCount = 0;
        var e = new ManualResetEventSlim(false);

        ThreadStart waitThreadStart = () =>
        {
            startTest.WaitOne();
            var allWaitersWoken = allWaitersWoken0;
            while (true)
            {
                e.Wait();
                if (Interlocked.Increment(ref waiterWokenCount) == threadCount)
                {
                    waiterWaitCount += threadCount;
                    waiterWokenCount = 0;
                    e.Reset();
                    (allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0).Reset();
                    allWaitersWoken.Set();
                }
                else
                    allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            startTest.WaitOne();
            var allWaitersWoken = allWaitersWoken0;
            while (true)
            {
                e.Set();
                allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        Interlocked.Exchange(ref waiterWaitCount, 0);

        // Measure

        sw.Restart();
        Thread.Sleep(500);
        sw.Stop();

        int waitCount = waiterWaitCount;

        long elapsedMs = sw.ElapsedMilliseconds;
        double score = (double)waitCount / elapsedMs;

        Console.WriteLine("Score: {0:0.000000}", score);
    }

I saw that simulating a longer PAUSE by multiplying the YieldProcessor by 14 (including the multiply by proc count) caused especially the second test to crawl even at proc count threads. In both tests, removing the multiply significantly improved the score.

Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388
- Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
- The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
- I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.
@kouvel kouvel added this to the 2.0.1 milestone Aug 23, 2017
@kouvel kouvel self-assigned this Aug 23, 2017
@kouvel kouvel requested a review from stephentoub August 23, 2017 23:00
@benaadams
Copy link
Member

Interesting; I wonder... I have some threading tests that noise dive as proc counts increase; always assumed it was straight contention; rather than multiplicative contention. Might give them a retest,

@benaadams
Copy link
Member

SpinWait also keeps increasing its processFactor multiplier until it hits CPU count

SPINNING_FACTOR == 100

Thread.SpinWait((turn + i) * SPINNING_FACTOR * processFactor)

//lock acquired failed and waiters updated
int processorCount = PlatformHelper.ProcessorCount;
if (turn < processorCount)
{
int processFactor = 1;
for (int i = 1; i <= turn * SPINNING_FACTOR; i++)
{
Thread.SpinWait((turn + i) * SPINNING_FACTOR * processFactor);
if (processFactor < processorCount)
processFactor++;
observedOwner = m_owner;
if ((observedOwner & LOCK_ANONYMOUS_OWNED) == LOCK_UNOWNED)

@Drawaes
Copy link

Drawaes commented Aug 23, 2017

I have some tests that have contention in the CLR + unmanaged contention that nose dive at high core counts (testing on 24 core/48 threads) that cause a big slowdown in throughput, I wonder if there is an interaction here.

@benaadams
Copy link
Member

Also lock/Monitor has a multiply on pause which you might want to check?

; If we are on an MP system, we try spinning for a certain number of iterations
cmp dword ptr [g_SystemInfo + OFFSETOF__g_SystemInfo__dwNumberOfProcessors], 1
jle FramedLockHelper
; Exponential backoff; delay by approximately 2*r10 clock cycles
mov eax, r10d
delayLoopThinLock:
pause ; indicate to the CPU that we are spin waiting
sub eax, 1
jnz delayLoopThinLock
; Next time, wait a factor longer
imul r10d, dword ptr [g_SpinConstants + OFFSETOF__g_SpinConstants__dwBackoffFactor]

coreclr/src/vm/syncblk.inl

Lines 143 to 155 in 2fd8b30

for (DWORD i = 0; i < spincount; i++)
{
YieldProcessor();
}
AwareLock::EnterHelperResult result = EnterObjMonitorHelper(pCurThread);
if (result != AwareLock::EnterHelperResult_Contention)
{
return result;
}
spincount *= g_SpinConstants.dwBackoffFactor;
if (spincount > g_SpinConstants.dwMaximumDuration)

PrepareToWaitThinLock:
; If we are on an MP system, we try spinning for a certain number of iterations
cmp dword ptr [g_SystemInfo + OFFSETOF__g_SystemInfo__dwNumberOfProcessors], 1
jle FramedLockHelper
; Exponential backoff; delay by approximately 2*r10 clock cycles
mov eax, r10d
delayLoopThinLock:
pause ; indicate to the CPU that we are spin waiting
sub eax, 1
jnz delayLoopThinLock
; Next time, wait a factor longer
imul r10d, dword ptr [g_SpinConstants + OFFSETOF__g_SpinConstants__dwBackoffFactor]

Not sure what dwBackoffFactor is but
also max pause loop dwMaximumDuration might get quite big if its (20000 * max(2, numProc))

coreclr/src/vm/vars.cpp

Lines 145 to 153 in 46ab1d1

// Configurable constants used across our spin locks
// Initialization here is necessary so that we have meaningful values before the runtime is started
// These initial values were selected to match the defaults, but anything reasonable is close enough
SpinConstants g_SpinConstants = {
50, // dwInitialDuration
40000, // dwMaximumDuration - ideally (20000 * max(2, numProc))
3, // dwBackoffFactor
10 // dwRepetitions
};

Which would now be 48 * 140 * 20000 = 134,400,000 cycles on a 48 core skylake with these new numbers?

@benaadams
Copy link
Member

Only seems to use proc count for max duration

Hex value specifying the multiplier on NumProcs to use when calculating the maximum spin duration

// Spinning heuristics
// Note that these only take effect once the runtime has been started; prior to that the values hardcoded in g_SpinConstants (vars.cpp) are used
//
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinInitialDuration, W("SpinInitialDuration"), 0x32, "Hex value specifying the first spin duration", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"), 0x3, "Hex value specifying the growth of each successive spin duration", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcFactor, W("SpinLimitProcFactor"), 0x4E20, "Hex value specifying the multiplier on NumProcs to use when calculating the maximum spin duration", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitConstant, W("SpinLimitConstant"), 0x0, "Hex value specifying the constant to add when calculating the maximum spin duration", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinRetryCount, W("SpinRetryCount"), 0xA, "Hex value specifying the number of times the entire spin process is repeated (when applicable)", EEConfig_default)

@kouvel
Copy link
Member Author

kouvel commented Aug 24, 2017

Yea monitor will need to be fixed as well, even the starting spin count is very large and multiplies very quickly to a max that is based on proc count. Once I have a scaling equivalent of YieldProcessor, all the spins that use it would need to be retuned once. Probably best to do them one by one.

@redknightlois
Copy link

Any idea when this is going to be available for testing on nightly build?

@karelz karelz modified the milestones: 2.0.x, 2.0.1 Aug 25, 2017
@kouvel
Copy link
Member Author

kouvel commented Aug 25, 2017

Here are the perf numbers, left is baseline and right is with this change.

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
MresWaitDrainRate 00.5Pc                      422.83 ±0.12%    415.17 ±0.94%     -1.81%
MresWaitDrainRate 00.5Pc Delay1usBeforeSet    370.83 ±0.18%    346.04 ±0.17%     -6.69%
MresWaitDrainRate 01.0Pc                      360.16 ±0.20%    556.28 ±0.37%     54.45%
MresWaitDrainRate 01.0Pc Delay1usBeforeSet    349.43 ±0.18%    513.77 ±0.25%     47.03%
MresWaitDrainRate 02.0Pc                      510.61 ±0.12%    693.86 ±0.16%     35.89%
MresWaitDrainRate 02.0Pc Delay1usBeforeSet    476.63 ±0.24%    654.41 ±0.59%     37.30%
MresWaitDrainRate 04.0Pc                      568.13 ±0.24%    735.37 ±0.68%     29.44%
MresWaitDrainRate 04.0Pc Delay1usBeforeSet    548.97 ±0.23%    712.59 ±0.37%     29.80%
MresWaitDrainRate 16.0Pc                      431.19 ±1.36%    983.79 ±0.41%    128.16%
MresWaitDrainRate 16.0Pc Delay1usBeforeSet    432.95 ±1.83%    963.60 ±0.90%    122.57%
MresWaitDrainRate 64.0Pc                      692.40 ±0.19%    602.29 ±0.23%    -13.02%
MresWaitDrainRate 64.0Pc Delay1usBeforeSet    695.47 ±0.23%    602.50 ±0.24%    -13.37%
MresWaitLatency 0.5Pc                        6339.00 ±0.52%  21863.24 ±0.38%    244.90%
MresWaitLatency 0.5Pc Delay1usBeforeSet      3081.19 ±1.14%  12209.38 ±0.36%    296.26%
MresWaitLatency 1.0Pc                       17364.42 ±8.99%  30075.79 ±0.95%     73.20%
MresWaitLatency 1.0Pc Delay1usBeforeSet     33023.26 ±0.67%  35936.94 ±2.12%      8.82%
------------------------------------------  ---------------  ---------------  ---------
Total                                        1026.01 ±1.05%   1536.96 ±0.57%     49.80%

The slight regression on 64 * proc count threads is minor and going to be fixed in my next PR.

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score        Right score      ∆ Score %
------------------------------------------  ----------------  ---------------  ---------
MresWaitDrainRate 00.5Pc                       389.50 ±0.25%    399.08 ±0.54%      2.46%
MresWaitDrainRate 00.5Pc Delay1usBeforeSet     395.61 ±0.21%    401.28 ±0.46%      1.43%
MresWaitDrainRate 01.0Pc                        56.17 ±0.50%    329.86 ±0.44%    487.28%
MresWaitDrainRate 01.0Pc Delay1usBeforeSet      56.13 ±0.29%    328.83 ±0.45%    485.79%
MresWaitDrainRate 02.0Pc                        97.80 ±1.37%    632.89 ±0.14%    547.10%
MresWaitDrainRate 02.0Pc Delay1usBeforeSet     104.50 ±2.49%    636.15 ±0.14%    508.77%
MresWaitDrainRate 04.0Pc                       120.89 ±0.03%    713.66 ±0.15%    490.36%
MresWaitDrainRate 04.0Pc Delay1usBeforeSet     119.04 ±0.14%    710.15 ±0.04%    496.56%
MresWaitDrainRate 16.0Pc                       173.55 ±0.25%    887.21 ±0.19%    411.22%
MresWaitDrainRate 16.0Pc Delay1usBeforeSet     173.79 ±0.23%    894.01 ±0.24%    414.44%
MresWaitDrainRate 64.0Pc                       185.82 ±0.19%    978.36 ±0.12%    426.50%
MresWaitDrainRate 64.0Pc Delay1usBeforeSet     186.14 ±0.02%    974.15 ±0.24%    423.35%
MresWaitLatency 0.5Pc                         1310.54 ±1.16%   8461.05 ±0.71%    545.62%
MresWaitLatency 0.5Pc Delay1usBeforeSet       1257.14 ±1.03%   8400.90 ±0.65%    568.25%
MresWaitLatency 1.0Pc                       11191.01 ±18.41%  29964.58 ±4.80%    167.76%
MresWaitLatency 1.0Pc Delay1usBeforeSet     10193.91 ±20.00%  25211.04 ±4.05%    147.31%
------------------------------------------  ----------------  ---------------  ---------
Total                                          322.96 ±3.13%   1364.19 ±0.85%    322.40%

The spin loop is currently optimized for older processors, that'll be mostly fixed in my next PR as well.

@stephentoub, could you please take a look?

@redknightlois
Copy link

@kouvel from what I see on our data, it looks like there is even more possibilities to improvement at lower levels, but not having access to the code at kernel level this is just a guess. This same effects may be added on top of the kernel own spinning.

image

Do not know anyone at the Kernel to ask, maybe you do.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @kouvel. I think it's fine. The testing you've done looks good, and these two paths are only used when blocking; I'll be more interesting in / want more scrutiny on any changes we make around spinning in some places where it's not about blocking, e.g. SpinWait, which is used by various concurrent data structures, by the thread pool, etc.

@kouvel
Copy link
Member Author

kouvel commented Aug 25, 2017

@redknightlois, I'm not seeing any time spent in the kernel on the Skylake processor (in the worst case scenario above). There might be some context switching happening in your scenario. You could try to profile with PerfView, include "Thread Time" and add the public symbol server to the symbol paths to get some functions names.

@kouvel
Copy link
Member Author

kouvel commented Aug 25, 2017

Thanks @stephentoub, my next change will be retuning SpinWait and normalizing Thread.SpinWait and I'm currently doing more testing on that.

@kouvel kouvel merged commit 29d740e into dotnet:master Aug 25, 2017
@kouvel kouvel deleted the MresSpinFix branch August 25, 2017 20:15
@redknightlois
Copy link

@kouvel Maybe it is a side effect. Let's see if it repeats after your changes hit nightly.

@karelz
Copy link
Member

karelz commented Aug 28, 2017

@kouvel this PR was marked as servicing (2.0.x), but it was done against master branch (i.e. 2.1).
If you want to bring it into servicing, please file a tracking issue and check with .NET Core shiproom. Thanks!

@karelz karelz modified the milestones: 2.1.0, 2.0.x Aug 28, 2017
@kouvel
Copy link
Member Author

kouvel commented Aug 28, 2017

Will do. @redknightlois, would you be able to confirm whether this helps with the issue you ran into?

@redknightlois
Copy link

@kouvel I was waiting for it to show up on the nightly build of the 2.0 branch. :(

@kouvel
Copy link
Member Author

kouvel commented Aug 28, 2017

Oh sorry, I'll put up a PR for that

kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 28, 2017
Port of dotnet#13556 to release/2.0.0
Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388
Fixes https://github.com/dotnet/coreclr/issues/13630
- Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
- The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
- I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.
@kevingosse
Copy link

We have been avoiding spin-based synchronization primitives for a while now, because we've seen that they perform poorly on servers with a large number of cores (32 cores in our case). Especially Task.Wait, which spins like crazy and eats unreasonable amounts of CPU. I'm eager to see this PR merged :)

@kouvel
Copy link
Member Author

kouvel commented Aug 29, 2017

@redknightlois, it looks like the 2.0.0 branch is currently closed for 2.0.1. We'll have to wait until it reopens for the next release.

@redknightlois
Copy link

@kouvel Let me know as soon as it is included. Probably I would be able to test this on 2.1 ... cause I have to test a few other optimizations that were done at the JIT.

weshaggard added a commit that referenced this pull request Sep 8, 2017
Don't multiply YieldProcessor count by proc count (ManualResetEventSlim, Task) (#13556)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants