Improve Monitor scaling #14216

kouvel · 2017-09-27T04:05:46Z

Fixes https://github.com/dotnet/coreclr/issues/13978

Refactored AwareLock::m_MonitorHeld into a class LockState with operations to mutate the state
Allowed the lock to be taken by a non-waiter when there is a waiter to prevent creating lock convoys
Added a bit to LockState to indicate that a waiter is signaled to wake, to avoid waking more than one waiter at a time. A waiter that wakes by observing the signal unsets this bit. See AwareLock::EnterEpilogHelper().
Added a spinner count to LockState. Spinners now register and unregister themselves and lock releasers don't wake a waiter when there is a registered spinner (the spinner guarantees to take the lock if it's available when unregistering itself)
- This was necessary mostly on Windows to reduce CPU usage to the expected level in contended cases with several threads. I believe it's the priority boost Windows gives to signaled threads, which seems to cause waiters to much more frequently succeed in acquiring the lock. This causes a CPU usage problem because once the woken waiter releases the lock, on the next lock attempt it will become a spinner. This keeps repeating, converting several waiters into spinners unnecessarily. Before registering spinners, I saw typically 4-6 spinners under contention (with delays inside and outside the lock) when I expected to have only 1-2 spinners at most.
- It costs an interlocked operation before and after the spin loop, doesn't seem to be too significant since spinning is a relatively slow path anyway, and the reduction in CPU usage in turn reduces contention on the lock and lets more useful work get done
Updated waiters to spin a bit before going back to waiting, reasons are explained in AwareLock::EnterEpilogHelper()
Removed AwareLock::Contention() and any references (this removes the 10 repeats of the entire spin loop in that function). With the lock convoy issue gone, this appears to no longer be necessary.

Perf

On Windows, throughput has increased significantly starting at slightly lower than proc count threads. On Linux, latency and throughput have increased more significantly at similar proc counts.
Most of the larger regressions are in the unlocked fast paths. The code there hasn't changed and is almost identical (minor layout differences), I'm just considering this noise until we figure out how to get consistently faster code generated.
The smaller regressions are within noise range

Processor for numbers below: Core i7 6700 4-core 8-thread

Monitor spin tests

With default spin counts
Code is in Add normalized equivalent of YieldProcessor, retune some spin loops #13670

Spin (Windows x64)                                      Left score        Right score       ∆ Score %
------------------------------------------------------  ----------------  ----------------  ---------
MonitorEnterExitLatency 02T                                665.73 ±0.46%     660.39 ±0.54%     -0.80%
MonitorEnterExitLatency 04T                               1499.47 ±0.52%    1502.81 ±0.29%      0.22%
MonitorEnterExitLatency 08T                               1731.28 ±0.10%    1743.89 ±0.14%      0.73%
MonitorEnterExitLatency 16T                               1707.25 ±0.31%    1747.36 ±0.24%      2.35%
MonitorEnterExitThroughput Delay 01T                      5138.81 ±0.09%    5120.83 ±0.10%     -0.35%
MonitorEnterExitThroughput Delay 02T                      4959.48 ±0.15%    4981.27 ±0.10%      0.44%
MonitorEnterExitThroughput Delay 04T                      4462.47 ±0.69%    4760.78 ±0.05%      6.68%
MonitorEnterExitThroughput Delay 08T                      3745.69 ±0.03%    4698.01 ±0.26%     25.42%
MonitorEnterExitThroughput Delay 16T                      3711.20 ±0.34%    4725.44 ±0.38%     27.33%
MonitorEnterExitThroughput_AwareLock 1T                  61200.72 ±0.03%   58933.89 ±0.06%     -3.70%
MonitorEnterExitThroughput_ThinLock 1T                   59430.78 ±0.05%   59396.10 ±0.03%     -0.06%
MonitorReliableEnterExitLatency 02T                        706.79 ±0.24%     705.74 ±0.41%     -0.15%
MonitorReliableEnterExitLatency 04T                       1491.37 ±0.26%    1525.77 ±0.17%      2.31%
MonitorReliableEnterExitLatency 08T                       1722.46 ±0.06%    1703.50 ±0.08%     -1.10%
MonitorReliableEnterExitLatency 16T                       1679.43 ±0.29%    1710.93 ±0.19%      1.88%
MonitorReliableEnterExitThroughput Delay 01T              5064.57 ±0.14%    5083.21 ±0.16%      0.37%
MonitorReliableEnterExitThroughput Delay 02T              4917.33 ±0.11%    4962.94 ±0.09%      0.93%
MonitorReliableEnterExitThroughput Delay 04T              4710.53 ±0.22%    4728.52 ±0.12%      0.38%
MonitorReliableEnterExitThroughput Delay 08T              3733.62 ±0.04%    4745.75 ±0.12%     27.11%
MonitorReliableEnterExitThroughput Delay 16T              3648.78 ±0.37%    4718.97 ±0.31%     29.33%
MonitorReliableEnterExitThroughput_AwareLock 1T          58397.83 ±0.06%   58527.30 ±0.11%      0.22%
MonitorReliableEnterExitThroughput_ThinLock 1T           58441.90 ±0.03%   56825.30 ±0.03%     -2.77%
MonitorTryEnterExitWhenUnlockedThroughput_AwareLock 1T   57540.11 ±0.05%   58440.14 ±0.05%      1.56%
MonitorTryEnterExitWhenUnlockedThroughput_ThinLock 1T    57684.81 ±0.04%   57747.39 ±0.06%      0.11%
MonitorTryEnterWhenLockedThroughput_AwareLock 1T        261834.12 ±0.12%  244767.50 ±0.07%     -6.52%
MonitorTryEnterWhenLockedThroughput_ThinLock 1T         241360.92 ±0.15%  261689.44 ±0.04%      8.42%
------------------------------------------------------  ----------------  ----------------  ---------
Total                                                     7513.73 ±0.19%    7828.46 ±0.16%      4.19%

Spin (Linux x64)                                        Left score        Right score       ∆ Score %
------------------------------------------------------  ----------------  ----------------  ---------
MonitorEnterExitLatency 02T                               3561.29 ±0.64%    3606.19 ±0.31%      1.26%
MonitorEnterExitLatency 04T                               3440.76 ±0.12%    3464.48 ±0.12%      0.69%
MonitorEnterExitLatency 08T                               2292.54 ±0.50%    3429.38 ±0.46%     49.59%
MonitorEnterExitLatency 16T                               2095.67 ±0.67%    3433.30 ±0.30%     63.83%
MonitorEnterExitThroughput Delay 01T                      5043.59 ±0.31%    5008.86 ±0.26%     -0.69%
MonitorEnterExitThroughput Delay 02T                      4972.94 ±0.04%    4955.36 ±0.03%     -0.35%
MonitorEnterExitThroughput Delay 04T                      4707.27 ±0.08%    4650.77 ±0.06%     -1.20%
MonitorEnterExitThroughput Delay 08T                      2637.27 ±0.20%    4601.87 ±0.22%     74.49%
MonitorEnterExitThroughput Delay 16T                      2312.48 ±0.81%    4650.57 ±0.11%    101.11%
MonitorEnterExitThroughput_AwareLock 1T                  58822.27 ±0.03%   57910.71 ±0.10%     -1.55%
MonitorEnterExitThroughput_ThinLock 1T                   57274.55 ±0.15%   56441.84 ±0.22%     -1.45%
MonitorReliableEnterExitLatency 02T                       3558.43 ±0.27%    3553.18 ±0.61%     -0.15%
MonitorReliableEnterExitLatency 04T                       2920.09 ±0.20%    3440.91 ±0.14%     17.84%
MonitorReliableEnterExitLatency 08T                       2269.82 ±0.05%    3052.08 ±6.08%     34.46%
MonitorReliableEnterExitLatency 16T                       2086.47 ±0.75%    3275.11 ±2.67%     56.97%
MonitorReliableEnterExitThroughput Delay 01T              5169.16 ±0.64%    5189.40 ±0.14%      0.39%
MonitorReliableEnterExitThroughput Delay 02T              5075.91 ±0.26%    5074.15 ±0.05%     -0.03%
MonitorReliableEnterExitThroughput Delay 04T              4602.26 ±1.49%    4767.14 ±0.05%      3.58%
MonitorReliableEnterExitThroughput Delay 08T              2773.58 ±0.07%    4744.31 ±0.28%     71.05%
MonitorReliableEnterExitThroughput Delay 16T              2438.12 ±0.66%    4776.94 ±0.13%     95.93%
MonitorReliableEnterExitThroughput_AwareLock 1T          56198.53 ±0.11%   55248.53 ±0.07%     -1.69%
MonitorReliableEnterExitThroughput_ThinLock 1T           56322.95 ±0.11%   55207.00 ±0.24%     -1.98%
MonitorTryEnterExitWhenUnlockedThroughput_AwareLock 1T   54299.42 ±0.10%   54580.95 ±0.07%      0.52%
MonitorTryEnterExitWhenUnlockedThroughput_ThinLock 1T    55292.39 ±0.20%   53787.62 ±0.26%     -2.72%
MonitorTryEnterWhenLockedThroughput_AwareLock 1T        216569.65 ±0.26%  213135.11 ±0.32%     -1.59%
MonitorTryEnterWhenLockedThroughput_ThinLock 1T         225565.05 ±0.68%  230333.25 ±0.22%      2.11%
------------------------------------------------------  ----------------  ----------------  ---------
Total                                                     8698.50 ±0.36%   10232.00 ±0.53%     17.63%

Raw numbers

Code is in https://github.com/dotnet/coreclr/issues/13978

Default spin heuristics

Tc = Thread count
WB = Windows baseline
WC = Windows changed
LB = Linux baseline
LC = Linux changed
L/ms = Locks taken per millisecond
Cpu = Number of threads worth of full CPU usage

Tc	WB L/ms	Cpu	WC L/ms	Cpu	LB L/ms	Cpu	LC L/ms	Cpu
1	5344.76	1	5262.77	1	5148.87	1	5044.62	1
1	5391.14	1	5433.34	1	5368.91	1	5262.98	1
1	5348.51	1	5296.45	1	5111.95	1	4988.24	1
2	5168.75	2	5110.51	2	4938.35	2	4882.07	2
2	5130.30	2	5130.79	2	4989.40	2	4857.45	2
2	5156.88	2	5117.37	2	4958.15	2	4932.81	2
4	4671.16	4	4885.46	4	4732.73	4	4655.51	3
4	4846.14	4	4847.92	4	4390.75	4	4597.65	3
4	9572.59	4	4839.56	4	4731.39	4	4652.74	3
8	3967.98	8	4832.56	4	2827.30	8	4621.80	3
8	4379.97	8	4901.81	4	2824.45	8	4574.13	3
8	4413.90	8	4826.43	4	2819.81	8	4642.79	3
16	3867.15	8	4826.82	4	2413.83	8	4634.02	3
16	4846.04	8	4815.79	4	2524.47	8	4604.29	3
16	5695.38	8	4853.34	4	2363.39	8	4633.61	3
32	3454.42	8	4615.44	4	1150.34	8	4665.71	3
32	6477.07	8	4508.67	4	1607.75	8	4665.56	3
32	7059.96	8	4479.78	4	1777.20	8	4659.60	3
64	3282.26	8	4611.93	4	0.70	8	4573.43	3
64	4991.26	8	4883.61	4	0.72	8	4672.35	3
64	7521.77	8	4907.73	4	0.56	8	4654.06	3
128	3149.11	8	4836.03	4	0.62	8	4621.83	3
128	8962.21	8	4872.29	4	0.61	8	4565.08	3
128	7503.35	8	4858.68	4	0.72	8	4667.60	3
256	3310.44	8	4755.44	4	82.47	8	4648.06	3
256	6567.14	8	4852.55	4	0.82	8	4626.99	3
256	5420.92	8	4884.91	4	0.79	8	4657.88	3
512	2081.56	8	4900.86	4	0.61	8	4648.02	3
512	4125.43	8	4863.88	4	0.71	8	4578.18	3
512	3685.25	8	4898.62	4	0.77	8	4601.07	3
1024	1436.46	8	4842.57	4	0.61	8	4593.89	3
1024	2900.22	8	4880.87	4	0.91	8	4627.16	3
1024	3808.18	8	4917.50	4	0.81	8	4615.44	3

Lower spin counts

set COMPlus_SpinInitialDuration=0x1
set COMPlus_SpinLimitProcFactor=0x80
set COMPlus_SpinBackoffFactor=0x2
set COMPlus_SpinRetryCount=0x0

Tc	WB L/ms	Cpu	WC L/ms	Cpu	LB L/ms	Cpu	LC L/ms	Cpu
1	5335.34	1	5264.16	1	4493.90	1	5019.77	1
1	5448.57	1	5439.20	1	4866.98	1	4872.09	1
1	5325.00	1	5297.86	1	5078.65	1	4541.78	1
2	4923.72	2	4859.70	2	4494.79	2	4565.30	2
2	4929.58	2	4834.64	2	4052.52	2	4582.16	2
2	4888.14	2	4865.00	2	3864.39	2	4572.25	2
4	4546.01	4	4357.90	3	90.96	2	3448.15	3
4	4531.11	4	4367.87	3	100.28	2	3449.68	3
4	4541.45	4	4357.42	3	125.34	2	3542.46	3
8	3752.67	8	4317.60	3	86.85	2	4700.69	3
8	3765.35	8	4367.94	3	83.59	2	4592.01	3
8	3716.64	8	4289.56	3	81.16	2	4554.52	3
16	97.13	8	4241.93	3	74.24	2	4550.29	2
16	96.85	8	4242.78	3	84.72	2	4416.01	2
16	97.21	8	4239.82	3	78.49	2	4560.80	2
32	97.11	8	4382.21	3	89.36	2	4405.91	2
32	96.83	8	4371.04	3	85.16	2	4255.65	2
32	97.18	8	4330.79	3	86.11	2	4158.98	2
64	96.86	8	4385.51	3	71.96	2	3839.43	2
64	97.08	8	4375.39	3	83.92	2	4145.21	2
64	96.84	8	4369.57	3	77.99	2	4261.37	2

Spinning disabled

set COMPlus_SpinInitialDuration=0x1
set COMPlus_SpinLimitProcFactor=0x0
set COMPlus_SpinBackoffFactor=0x2
set COMPlus_SpinRetryCount=0x0

Tc	WB L/ms	Cpu	WC L/ms	Cpu	LB L/ms	Cpu	LC L/ms	Cpu
1	5268.41	1.0	5343.09	1	5030.34	1.0	5034.13	1
1	5464.13	1.0	5348.70	1	5361.13	1.0	5028.44	1
1	5255.47	1.0	5317.01	1	5055.16	1.0	4867.83	1
2	157.00	1.5	2930.07	2	92.38	1.5	2475.49	2
2	272.23	1.5	2964.10	2	93.58	1.5	2418.42	2
2	269.83	1.5	2939.90	2	89.18	1.5	2374.65	2
4	215.94	1.5	2835.81	2	101.05	1.5	2125.89	2
4	213.10	1.5	2713.06	2	99.78	1.5	2002.67	2
4	214.34	1.5	2725.35	2	119.61	1.5	2089.86	2
8	215.03	1.5	2908.95	2	102.79	1.5	4650.46	2
8	219.86	1.5	2853.62	2	86.06	1.5	4718.31	2
8	209.58	1.5	2882.41	2	110.83	1.5	4561.14	2
16	217.18	1.5	2886.65	2	96.64	1.5	4770.49	2
16	213.08	1.5	2884.71	2	94.27	1.5	4687.61	2
16	220.32	1.5	2870.19	2	96.02	1.5	4735.26	2

kouvel · 2017-10-03T00:19:31Z

@vancem, would you be able to validate the design? If the design is ok then maybe @tarekgh or @stephentoub could review the code.

vancem · 2017-10-03T17:28:26Z

I will take a look later today.

vancem · 2017-10-05T14:02:16Z

Sorry for the I started to look at this on Tues, but got preempted. I will continue today.

vancem · 2017-10-05T14:03:08Z

@dotnet-bot test Ubuntu x64 Formatting
@dotnet-bot test Ubuntu x64 Checked Build and Test

vancem · 2017-10-06T00:46:50Z

src/debug/daccess/request.cpp

@@ -3726,7 +3726,7 @@ ClrDataAccess::GetSyncBlockData(unsigned int SBNumber, struct DacpSyncBlockData
            }
 #endif // FEATURE_COMINTEROP

-            pSyncBlockData->MonitorHeld = pBlock->m_Monitor.m_MonitorHeld;
+            pSyncBlockData->MonitorHeld = pBlock->m_Monitor.m_lockState.VolatileLoad().GetMonitorHeldState();


I would like 'ProbablyMonitorHeld to be a property and then define that to be m_lockState.VolatileLoad().GetMonitorHeldState(). In general I would like m_lockState to be private to AwareLock

Yea I should have done that, will do

vancem · 2017-10-06T00:52:58Z

src/inc/clrconfigvalues.h

@@ -675,7 +675,7 @@ RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"),
 RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default)
 RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcFactor, W("SpinLimitProcFactor"), 0x4E20, "Hex value specifying the multiplier on NumProcs to use when calculating the maximum spin duration", EEConfig_default)
 RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitConstant, W("SpinLimitConstant"), 0x0, "Hex value specifying the constant to add when calculating the maximum spin duration", EEConfig_default)
-RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinRetryCount, W("SpinRetryCount"), 0xA, "Hex value specifying the number of times the entire spin process is repeated (when applicable)", EEConfig_default)
+RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinRetryCount, W("SpinRetryCount"), 0x0, "Hex value specifying the number of times the entire spin process is repeated (when applicable)", EEConfig_default)


This now is only used in UTSemReadWrite. Is that your intent? Seems like we have not benchmarked the effect of that.

No I should revert this change (it has no impact on monitor anymore). I had filed an issue to deal with other spin loops that use the same constants but I forgot about the impact of this change, thanks for pointing it out.

kouvel · 2017-10-06T16:03:49Z

@dotnet-bot test CentOS7.1 x64 Release Priority 1 Build and Test

vancem · 2017-10-06T16:14:46Z

src/inc/clrconfigvalues.h

@@ -675,7 +675,7 @@ RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"),
 RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default)


I realize that your change is not about tuning the spin counts, and this comment should certainly not block merge, but of all the numbers that is 'out of wack' I think the SpinLimitProcCap is one that causes grief (on machines with > 8 procs). The fact that you have more processors really is only weakly correlated with how valuable additional spinning might be (It requires all the processors to be beating on the same lock, which frankly is never going to work well, so we should not be tuning for it). Simply changing the cap to something 8, (I would prefer 4) would make me significantly happier. Consider experimenting with this when you have the time.

I'm currently working on reducing the spin counts. I was planning on removing the proc count multiplier altogether and to just make it a simple configurable max number of iterations, I found that it didn't work very well in other cases. It seems to be like more procs is not a good reason by itself to spin more.

I have always been suspicious that the Proc-count multiplier was causing WAY too much spinning (at least for large CPU count machines). Theory says you should spin for the expected time of the delay. Now the more CPUs contending on the same lock (that is otherwise held for a very short time), would argue for more spinning, but probably not by the FACTOR the MAXIMUM number of CPUs possible (if you have more than 4 CPU constantly contending for a single lock you already have a scalability problem to fix, so increasing the spinning by more than that seems pointless).

For what it is worth, I still have not given up on the idea that the amount of spinning should be based on HOW LONG IT TOOK THE LAST TIME. Unlike the reader-writer lock (where we didn't want the complexity of doing Stopwatch/QPC timing, you CAN do that easily for monitor. Thus snap a timestamp when contention starts and stop (when the wait wakes up from any spinning/waits), and you now have PRECISE estimate of a good amount of time to spin. Keep a running average and you should have a VERY good amount for THAT lock, which seems like something close to optimal. If this time is over some threshold (somewhere between 1 and 30usec) bail, and don't bother spinning at all. This is what I would suggest.

You would only do this for AwareLocks, but frankly any lock that you DO wait on become and awarelock (indeed I would argue that if the first Interlock does not succeed, and the aware lock exists (that is you waited in the past), then you should do this history-based amount of spinning (since you have place to put the timestamps and other logic).

Now the more CPUs contending on the same lock (that is otherwise held for a very short time), would argue for more spinning, but probably not by the FACTOR the MAXIMUM number of CPUs possible

I may be better to spin a little more, or it may be better to spin a bit less and have more of those threads wait. Since only one thread can hold the lock at a time, it may also be better to cap the total number of spinners to some low number. I'll do some experiments around these in the future if I have some time.

For what it is worth, I still have not given up on the idea that the amount of spinning should be based on HOW LONG IT TOOK THE LAST TIME.

Agreed, this could be the best thing to do for monitor. I plan on looking into that idea as part of issue https://github.com/dotnet/coreclr/issues/13981 once we have a working low spin count that the dynamic tuning can work with.

sharwell · 2017-10-06T19:10:17Z

@kouvel Is there a way to provide a measure of "balance" (fairness) in the testing?

@tannergooding Your machine is 16C/32T right? I'm interested in seeing hard numbers for bigger machines, if someone is able to do that.

tannergooding · 2017-10-06T19:24:43Z

Yeah, I have a 8c/16t and a 16c/32t. I can try to test this weekend

kouvel · 2017-10-06T19:24:49Z

@sharwell:

Is there a way to provide a measure of "balance" (fairness) in the testing?

The latency tests give a measure of how quickly another spinning/waiting thread responds to the lock being released. To measure fairness I can try to add tests that measure the total wait duration including spinning and waiting portions. I expect fairness would decrease with this change, at least it should decrease since that's the idea.

kouvel · 2017-10-06T19:26:20Z

@tannergooding thanks that would be great, I can send you PGO'ed binaries (profiled only on some of the tests above) and my perf harness for the tests too if that would be useful.

kouvel · 2017-10-29T06:16:09Z

@dotnet-bot test Ubuntu armlb Cross Debug Innerloop Build

Part 1: Improve Monitor scaling Fixes https://github.com/dotnet/coreclr/issues/13978 - Refactored AwareLock::m_MonitorHeld into a class LockState with operations to mutate the state - Allowed the lock to be taken by a non-waiter when there is a waiter to prevent creating lock convoys - Added a bit to LockState to indicate that a waiter is signaled to wake, to avoid waking more than one waiter at a time. A waiter that wakes by observing the signal unsets this bit. See AwareLock::EnterEpilogHelper(). - Added a spinner count to LockState. Spinners now register and unregister themselves and lock releasers don't wake a waiter when there is a registered spinner (the spinner guarantees to take the lock if it's available when unregistering itself) - This was necessary mostly on Windows to reduce CPU usage to the expected level in contended cases with several threads. I believe it's the priority boost Windows gives to signaled threads, which seems to cause waiters to much more frequently succeed in acquiring the lock. This causes a CPU usage problem because once the woken waiter releases the lock, on the next lock attempt it will become a spinner. This keeps repeating, converting several waiters into spinners unnecessarily. Before registering spinners, I saw typically 4-6 spinners under contention (with delays inside and outside the lock) when I expected to have only 1-2 spinners at most. - It costs an interlocked operation before and after the spin loop, doesn't seem to be too significant since spinning is a relatively slow path anyway, and the reduction in CPU usage in turn reduces contention on the lock and lets more useful work get done - Updated waiters to spin a bit before going back to waiting, reasons are explained in AwareLock::EnterEpilogHelper() - Removed AwareLock::Contention() and any references (this removes the 10 repeats of the entire spin loop in that function). With the lock convoy issue gone, this appears to no longer be necessary. Perf - On Windows, throughput has increased significantly starting at slightly lower than proc count threads. On Linux, latency and throughput have increased more significantly at similar proc counts. - Most of the larger regressions are in the unlocked fast paths. The code there hasn't changed and is almost identical (minor layout differences), I'm just considering this noise until we figure out how to get consistently faster code generated. - The smaller regressions are within noise range Part 2: Reduce Monitor spinning Fixes https://github.com/dotnet/coreclr/issues/13980 - Added new config value Monitor_SpinCount and Monitor spins for that many iterations, default is 30 (0x1e). This seems to give a somewhat decent balance between latency, fairness, and throughput. Lower spin counts improve latency and fairness significantly and regress throughput slightly, and higher spin counts improve throughput slightly and regress latency and fairness significantly. - The other constants can still be used to disable spinning but otherwise they are no longer used by Monitor - Decreased the number of bits used for tracking spinner count to 3. This seems to be more than enough since only one thread can take a lock at a time, and prevents spikes of unnecessary CPU usage. Tried some things that didn't pan out: - Sleep(0) doesn't seem to add anything to the spin loop, so left it out. Instead of Sleep(0) it can just proceed to waiting. Waiting is more expensive than Sleep(0), but I didn't see that benefit in the tests. Omitting Sleep(0) also keeps the spin loop very short (a few microseconds max). - Increasing the average YieldProcessor() duration per spin iteration improved thorughput slightly but regressed latency and fairness very quickly. Given that fairness is generally worse with part 1 of this change above, it felt like a better compromise to take a small reduction in throughput for larger improvements in latency and fairness. - Tried adding a very small % of lock releases by random wake a waiter despite there being spinners to improve fairness. This improved fairness noticeably but not as much as decreasing the spin count slightly, and it was making latency and throughput worse more quickly. After reducing the % to a point where I was hardly seeing fairness improvements, there were still noticeable latency and throughput regressions. Miscellaneous - Moved YieldProcessorNormalized code into separate files so that they can be included earlier and where needed - Added a max for "optimal max normalized yields per spin iteration" since it has a potential to be very large on machines where YieldProcessor may be implemented as no-op, in which case it's probably not worth spinning for the full duration - Refactored duplicate code in portable versions of MonEnterWorker, MonEnter, and MonReliableEnter. MonTryEnter has a slightly different structure, did not refactor that. Perf - Throughput is a bit lower than before at lower thread counts and better at medium-high thread counts. It's a bit lower at lower thread counts because of two reasons: - Shorter spin loop means the lock will be polled more frequently because the exponential backoff does not get as high, making it more likely for a spinner to steal the lock from another thread, causing the other thread to sometimes wait early - The duration of YieldProcessor() calls per spin iteration has decreased and a spinner or spinning waiter are more likely to take the lock, the rest is similar to above - For the same reasons as above, latency is better than before. Fairness is better on Windows and worse on Linux compared to baseline due to the baseline having differences between these platforms. Latency also has differences between Windows/Linux in the baseline, I suspect those are due to differences in scheduling. - Performance now scales appropriately on processors with different pause delays

kouvel · 2017-11-16T09:24:02Z

Added a commit that mitigates waiter starvation and rebased other commits.

Normally, threads are allowed to preempt waiters to acquire the lock. There are cases where waiters can be easily starved as a result. For example, a thread that holds a lock for a significant amount of time (much longer than the time it takes to do a context switch), then releases and reacquires the lock in quick succession, and repeats. Though a waiter would be woken upon lock release, usually it will not have enough time to context-switch-in and take the lock, and can be starved for an unreasonably long duration.

In order to prevent such starvation and force a bit of fair forward progress, it is sometimes necessary to change the normal policy and disallow threads from preempting waiters. A new bit was added to LockState and ShouldNotPreemptWaiters() indicates the current state of the policy.

When the first waiter begins waiting, it records the current time as a "waiter starvation start time". That is a point in time after which no forward progress has occurred for waiters. When a waiter acquires the lock, the time is updated to the current time.
Before a spinner begins spinning, and when a waiter is signaled to wake, it checks whether the starvation duration has crossed a threshold (currently one second) and if so, sets ShouldNotPreemptWaiters()

When unreasonable starvation is occurring, the lock will be released occasionally and if caused by spinners, spinners will be starting to spin.

Before starting to spin, if ShouldNotPreemptWaiters() is set, the spinner will skip spinning and wait instead. Spinners that are already registered at the time ShouldNotPreemptWaiters() is set will stop spinning as necessary. Eventually, all spinners will drain and no new ones will be registered.
After spinners have drained, only a waiter will be able to acquire the lock. When a waiter acquires the lock, or when the last waiter unregisters itself, ShouldNotPreemptWaiters() is cleared to restore the normal policy.

There was no change to perf from before. In the example above, where before the change, a waiter is starved for a very long time (in the order of 10s of seconds to minutes), after the change the waiter acquires the lock once per second.

Normally, threads are allowed to preempt waiters to acquire the lock. There are cases where waiters can be easily starved as a result. For example, a thread that holds a lock for a significant amount of time (much longer than the time it takes to do a context switch), then releases and reacquires the lock in quick succession, and repeats. Though a waiter would be woken upon lock release, usually it will not have enough time to context-switch-in and take the lock, and can be starved for an unreasonably long duration. In order to prevent such starvation and force a bit of fair forward progress, it is sometimes necessary to change the normal policy and disallow threads from preempting waiters. A new bit was added to LockState and ShouldNotPreemptWaiters() indicates the current state of the policy. - When the first waiter begins waiting, it records the current time as a "waiter starvation start time". That is a point in time after which no forward progress has occurred for waiters. When a waiter acquires the lock, the time is updated to the current time. - Before a spinner begins spinning, and when a waiter is signaled to wake, it checks whether the starvation duration has crossed a threshold (currently one second) and if so, sets ShouldNotPreemptWaiters() When unreasonable starvation is occurring, the lock will be released occasionally and if caused by spinners, spinners will be starting to spin. - Before starting to spin, if ShouldNotPreemptWaiters() is set, the spinner will skip spinning and wait instead. Spinners that are already registered at the time ShouldNotPreemptWaiters() is set will stop spinning as necessary. Eventually, all spinners will drain and no new ones will be registered. - After spinners have drained, only a waiter will be able to acquire the lock. When a waiter acquires the lock, or when the last waiter unregisters itself, ShouldNotPreemptWaiters() is cleared to restore the normal policy.

kouvel · 2017-11-16T22:41:20Z

@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test

vancem · 2017-11-17T02:23:14Z

src/vm/jithelpers.cpp

@@ -4427,69 +4427,18 @@ NOINLINE static void JIT_MonEnter_Helper(Object* obj, BYTE* pbLockTaken, LPVOID
 }

 /*********************************************************************/
-NOINLINE static void JIT_MonContention_Helper(Object* obj, BYTE* pbLockTaken, LPVOID __me)


I am trying to rationalize the value of this MonContention helper. This is uniformly called as a second last chance' before bailing and calling the 'slow' MonEnter_Helper path. Why do we not simply skip this and call the slow helper and be done with it? At this point we don't care about CPU as we need to kill some time anyway.

Not sure I follow, as that's kind of what I did, I removed the MonContention helper. The main purpose of it before the changes was to repeat the entire spin loop 10 more times before waiting. I suspect it was necessary mainly to avoid lock convoys, as spinning more helps to avoid them with the previous design. After the changes, there is a first-level fast path check for the unlocked case (one try for the lock), followed by a spin loop of a default of 30 iterations, then the slow path that waits.

Sorry, my mistake ignore my comment. I was looking at the wrong branch on my machine and confused myself about that fact that this code is being removed.

vancem · 2017-11-17T23:47:00Z

src/vm/syncblk.cpp

            SyncBlock *syncBlock = g_pSyncTable[oldValue & MASK_SYNCBLOCKINDEX].m_SyncBlock;
            _ASSERTE(syncBlock != NULL);
            AwareLock *awareLock = &syncBlock->m_Monitor;
-            if (awareLock->EnterHelper(pCurThread, true /* checkRecursiveCase */))
+
+            AwareLock::EnterHelperResult result = awareLock->TryEnterBeforeSpinLoopHelper(pCurThread);


I realize that what I am about to comment on is not part of the delta in this PR, but I am trying to rationalize the whole code flow here. My mental model for the EnterObjMonitorHelperSpin helper is that it was there to avoid making an awareLock (otherwise even ONE contention on a lock at any time would force the awareLock to be created). However we have an awareLock at this point, so the question is why can't we just return UseSlowPath and avoid this code? This code should be a clone of what is done in AwareLock, and the extra cost of getting there (making an awareLock (but we have one) and setting up a transition frame (but we need to spin anyway), should not be an issue. It would be nice to simply remove this code. What prevents this?

I was thinking about that, it's kind of a pity to have to force-inline the moderately slow-path
spinning code (which it used to do in the asm helpers to avoid calls). The overhead of the transition is not something I fully understand but my high level guess is that taking the slow path is equivalent in overhead to a p/invoke, which is fairly high. I'm not sure if it will matter though, probably worth a test. Moving spinning to the slow path would also significantly improve register allocation in the fast path since it removes the last offending call. I plan on looking into this more closely as part of https://github.com/dotnet/coreclr/issues/14571.

Yes avoiding creating the aware lock is another advantage of spinning more (when it's a thin lock anyway). With this change a monitor would transition into an aware lock much more easily, there are perf/memory tradeoffs involved and I think this is a decent one.

Spinning can't be avoided altogether once we have an aware lock. Spinning just a little bit significantly improves perf under contention and avoids excessive context switching in scenarios involving smaller time scales, which I'm sure are very common.

Regarding moving the spin path to the slow path, I don't think it matters whether we have an an aware lock already or not, it's the same kind of situation for a thin lock.

Just FYI, Yes, a HelperMethodFrame transition is like a PINVOKE, but it is slower (because it does something called epilog walking to set up the frame). I have not measured it recently, but a normal optimized PINVOKE is 20-30 clock ticks. HelperMethodFrames are probably 2-3 times that. There is some goo to get from there to the AwareLock.Enter() where we again try to take the lock with in interlocked operation. But it seems like that overhead is basically on the same order as what we would spin for, so we should not be too concerned about the overhead.

As you say it has the benefit of making the register allocation leaner for the paths that remain. But I also care about the fact that we are cloning non-trivial code (we have a very similar spin wait loop in EnterEpilogHelper). The less cloned code we have the better (for example we don't do the ETW contention logging in TryEnterObjMonitorSpinHelper, and arguably we should. It would also be SO easy for the logic that counts spinners to be off in one place but not the other).

I agree we need spinning in the aware-lock case. I just want it done one place if possible, and it seems like we are doing it in EnterEpilogHelper so that should be enough.

The value of trying this now, is that we don't have to deal with any bugs associated with the code we are about to remove. I suppose it could wait until #14571, but if you see no issue, and removing it does not cause regressions it is a very nice improvement in my opinion...

On the issue of spinning to avoid creating an awareLock: I would not spin more just to avoid Aware lock creation. If spinning just a bit more is what saved you, odds are that sooner or later you will 'fall off' that edge and need to make the lock. The key part to handle the case where this is little to no contention (which is common in many cases).

I take back some of what I said above. I see that the logic in EnterEpilogHelper does the Wait before is does any spinning (it does the spinning on waking up). So you can't just remove the code in EnterEpilogHelper without losing spinning in the AwareLock case before waiting.

I must admit I would prefer that it work the way I thought it worked (that there is no logic in EnterObjMonitorHelperSpin to AwareLock (we always bail to the slow helper)). This would require that AwareLock.EnterEpilog to spin before waiting (it is the same loop, you just spin before you wait). Thus all AwareLock logic is now centralized (LockState actually does no need to be in a header-file, it is only used in AwareLock's code in SyncBlk.cpp).

But that is not a trivial change (not huge, but not trivial), and would need to be perf-tested. I will not ask you to make that change to get this fix check in.

Please do consider it, however. I think it would simplify the code and as you mention, may help #14571.

For what it is worth, it seems that EnterObjMonitorHelper EnterObjMonitorHelperSpin are only called from one place TryEnterObjMonitorSpinHelper, and there are some tricky conditions that are cloned in each of these methods. You could effectively combine them to EnterObjMonitorHelperSpin which pretty much just EnterObjMonitorHelper with a loop that calls SpinWait

This of course only works well if we first kick all the AwareLock stuff out of EnterObjMonitorHelperSpin suggested above.

While I like this idea, I am not suggesting you do it, just giving you food for thought.

RE HelperMethodFrame overhead, the overhead may even be worth it as it would add a potentially beneficial delay before the first try in the spin loop. Can't be sure until I try it.

The value of trying this now, is that we don't have to deal with any bugs associated with the code we are about to remove. I suppose it could wait until #14571, but if you see no issue, and removing it does not cause regressions it is a very nice improvement in my opinion...

I would like to keep this effort separate from this PR due to time constraints. My estimation is that this is not a short task. If it would be beneficial in avoiding bugs, we could consider postponing this PR but in that event a smaller PR that fixes the YieldProcessor issue for Skylake processors would be necessary in the interim in case the full set of changes don't make it into 2.1.

But I also care about the fact that we are cloning non-trivial code (we have a very similar spin wait loop in EnterEpilogHelper). The less cloned code we have the better
It would also be SO easy for the logic that counts spinners to be off in one place but not the other

The spin loops in EnterObjMonitorHelperSpin and EnterEpilogHelper are distinct and most of their logic is not shared:

EnterObjMonitorHelperSpin spin loop for AwareLock

Before spin loop, atomically attempt to acquire lock or register spinner (must not register spinner if lock is available)

In spin loop, if lock is available, atomically acquire lock and unregister spinner

After spin loop, unregister spinner, but if lock was available at the time of unregistering the spinner, the lock must be acquired and the spinner unregistered atomically unless the lock is stolen by another thread, the comment there explains why

EnterEpilogHelper spin loop

Precondition: waiter is registered

In spin loop, if lock is available, atomically acquire lock and clear waiter-signaled-to-wake bit

After spin loop, clear waiter-signaled-to-wake bit, but if lock was available at the time of clearing the bit, the lock must be acquired and the waiter unregistered atomically unless the lock is stolen by another thread. If lock is not acquired, it will proceed to waiting again.

There are also differences relevant to mitigating waiter starvation. There is some similarly in the flow and it's likely possible to refactor the flow to be shared, but to me it doesn't seem worth it.

we don't do the ETW contention logging in TryEnterObjMonitorSpinHelper, and arguably we should

This would be far too verbose. Considering how short the spin duration is, from the perspective of the event I think it's fair to consider that there is no contention until waiting is necessary. It may even already be too verbose compared to before, but it's something that can be fixed later if it turns out to be an issue.

For what it is worth, it seems that EnterObjMonitorHelper EnterObjMonitorHelperSpin are only called from one place TryEnterObjMonitorSpinHelper, and there are some tricky conditions that are cloned in each of these methods.

Agreed, and there's certainly more refactoring that can be done to simplify this code. At this point though, I think a separate PR would be appropriate for those kinds of changes.

You could effectively combine them to EnterObjMonitorHelperSpin which pretty much just EnterObjMonitorHelper with a loop that calls SpinWait

This is actually what I tried initially but this showed significant perf regressions on Linux and I decided to keep them separate and not inline the spin loop. On Windows there was not much difference in perf. Since we're thinking about moving the spinning to the slow helper path anyway, it may make sense to keep them separate at least until that idea can be tested.

Filed issue https://github.com/dotnet/coreclr/issues/15095 to track monitor code cleanup

vancem · 2017-11-20T16:07:46Z

I have set my approval for this change. I do want to spend a bit more time looking over the code (I am basically creating a PR where I put a bunch of comments in to document the design), but what is clear however is that this is a good step in the right direction. We should proceed.

kouvel · 2017-11-27T16:57:24Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

kouvel · 2017-11-27T18:01:24Z

Thanks @vancem, I'll go ahead and merge the changes once everything passes. Let me know if you have any other feedback or if some parts could use more documentation.

kouvel added area-System.Threading tenet-performance Performance related issue labels Sep 27, 2017

kouvel added this to the 2.1.0 milestone Sep 27, 2017

kouvel self-assigned this Sep 27, 2017

kouvel requested review from stephentoub, vancem and tarekgh September 27, 2017 04:05

dnfclas added the cla-already-signed label Sep 27, 2017

kouvel force-pushed the MonitorScaleFix branch 7 times, most recently from 4d560ef to b2055e7 Compare September 27, 2017 15:56

kouvel force-pushed the MonitorScaleFix branch from b2055e7 to 8c2268b Compare October 5, 2017 17:05

vancem reviewed Oct 6, 2017

View reviewed changes

kouvel force-pushed the MonitorScaleFix branch from a0c4f95 to b5c407c Compare October 6, 2017 08:20

vancem reviewed Oct 6, 2017

View reviewed changes

kouvel force-pushed the MonitorScaleFix branch from b5c407c to 4bb7a6f Compare October 6, 2017 19:39

kouvel force-pushed the MonitorScaleFix branch 2 times, most recently from 8bed8e9 to 34eebbf Compare October 28, 2017 22:53

kouvel mentioned this pull request Oct 28, 2017

Clean up YieldProcessorNormalized #14739

Merged

kouvel force-pushed the MonitorScaleFix branch from 34eebbf to 071d392 Compare October 29, 2017 03:58

kouvel force-pushed the MonitorScaleFix branch from 071d392 to c02b433 Compare October 31, 2017 19:06

kouvel added 7 commits November 14, 2017 12:10

Fix GetMonitorHeldState()

bd38425

Address feedback, miscellaneous cleanup

84cbc04

Add back contention history, ETW events, and logging in the wait path

0768db2

Add comment

67d14c9

Fix mistake in wait path

76a0dc2

Reduce Monitor spinning

b84350c

kouvel force-pushed the MonitorScaleFix branch from c02b433 to 9f94b68 Compare November 16, 2017 09:19

kouvel force-pushed the MonitorScaleFix branch from 9f94b68 to fef475e Compare November 16, 2017 09:25

kouvel force-pushed the MonitorScaleFix branch from fef475e to 98d7dbc Compare November 16, 2017 09:30

Initialize variable

90683a5

vancem reviewed Nov 17, 2017

View reviewed changes

Change waiter starvation threshold to 100 ms

4de6cc7

vancem reviewed Nov 17, 2017

View reviewed changes

vancem approved these changes Nov 20, 2017

View reviewed changes

kouvel merged commit 93ddc4d into dotnet:master Nov 27, 2017

kouvel deleted the MonitorScaleFix branch November 27, 2017 18:28

kouvel mentioned this pull request Jan 31, 2020

Clean up Monitor code dotnet/runtime#9290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Monitor scaling #14216

Improve Monitor scaling #14216

kouvel commented Sep 27, 2017 •

edited

Loading

kouvel commented Oct 3, 2017

vancem commented Oct 3, 2017

vancem commented Oct 5, 2017

vancem commented Oct 5, 2017

vancem Oct 6, 2017

kouvel Oct 6, 2017

vancem Oct 6, 2017

kouvel Oct 6, 2017

kouvel commented Oct 6, 2017

vancem Oct 6, 2017

kouvel Oct 6, 2017

vancem Oct 6, 2017

kouvel Oct 6, 2017

sharwell commented Oct 6, 2017

tannergooding commented Oct 6, 2017

kouvel commented Oct 6, 2017

kouvel commented Oct 6, 2017

kouvel commented Oct 29, 2017

kouvel commented Nov 16, 2017 •

edited

Loading

kouvel commented Nov 16, 2017

vancem Nov 17, 2017

kouvel Nov 17, 2017

vancem Nov 17, 2017

vancem Nov 17, 2017

kouvel Nov 18, 2017 •

edited

Loading

vancem Nov 18, 2017

vancem Nov 18, 2017

vancem Nov 18, 2017

kouvel Nov 18, 2017

kouvel Nov 18, 2017

vancem commented Nov 20, 2017

kouvel commented Nov 27, 2017

kouvel commented Nov 27, 2017

		@@ -675,7 +675,7 @@ RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"),
		RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default)

Improve Monitor scaling #14216

Improve Monitor scaling #14216

Conversation

kouvel commented Sep 27, 2017 • edited Loading

Monitor spin tests

Raw numbers

Default spin heuristics

Lower spin counts

Spinning disabled

kouvel commented Oct 3, 2017

vancem commented Oct 3, 2017

vancem commented Oct 5, 2017

vancem commented Oct 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouvel commented Oct 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharwell commented Oct 6, 2017

tannergooding commented Oct 6, 2017

kouvel commented Oct 6, 2017

kouvel commented Oct 6, 2017

kouvel commented Oct 29, 2017

kouvel commented Nov 16, 2017 • edited Loading

kouvel commented Nov 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouvel Nov 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vancem commented Nov 20, 2017

kouvel commented Nov 27, 2017

kouvel commented Nov 27, 2017

kouvel commented Sep 27, 2017 •

edited

Loading

kouvel commented Nov 16, 2017 •

edited

Loading

kouvel Nov 18, 2017 •

edited

Loading