Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Don't multiply YieldProcessor count by proc count (ManualResetEventSlim, Task) (#13556) #13631

Merged
merged 1 commit into from
Sep 8, 2017

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Aug 28, 2017

Port of #13556 to release/2.0.0
Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388
Fixes https://github.com/dotnet/coreclr/issues/13630

  • Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
  • The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
  • I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.

Port of dotnet#13556 to release/2.0.0
Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388
Fixes https://github.com/dotnet/coreclr/issues/13630
- Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
- The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
- I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.
@kouvel kouvel added area-System.Threading bug Product bug (most likely) tenet-performance Performance related issue labels Aug 28, 2017
@kouvel kouvel added this to the 2.0.x milestone Aug 28, 2017
@kouvel kouvel self-assigned this Aug 28, 2017
@kouvel kouvel requested a review from stephentoub August 28, 2017 20:10
@kouvel
Copy link
Member Author

kouvel commented Aug 28, 2017

Here are the perf numbers (from #13556), left is baseline and right is with this change.

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
MresWaitDrainRate 00.5Pc                      422.83 ±0.12%    415.17 ±0.94%     -1.81%
MresWaitDrainRate 00.5Pc Delay1usBeforeSet    370.83 ±0.18%    346.04 ±0.17%     -6.69%
MresWaitDrainRate 01.0Pc                      360.16 ±0.20%    556.28 ±0.37%     54.45%
MresWaitDrainRate 01.0Pc Delay1usBeforeSet    349.43 ±0.18%    513.77 ±0.25%     47.03%
MresWaitDrainRate 02.0Pc                      510.61 ±0.12%    693.86 ±0.16%     35.89%
MresWaitDrainRate 02.0Pc Delay1usBeforeSet    476.63 ±0.24%    654.41 ±0.59%     37.30%
MresWaitDrainRate 04.0Pc                      568.13 ±0.24%    735.37 ±0.68%     29.44%
MresWaitDrainRate 04.0Pc Delay1usBeforeSet    548.97 ±0.23%    712.59 ±0.37%     29.80%
MresWaitDrainRate 16.0Pc                      431.19 ±1.36%    983.79 ±0.41%    128.16%
MresWaitDrainRate 16.0Pc Delay1usBeforeSet    432.95 ±1.83%    963.60 ±0.90%    122.57%
MresWaitDrainRate 64.0Pc                      692.40 ±0.19%    602.29 ±0.23%    -13.02%
MresWaitDrainRate 64.0Pc Delay1usBeforeSet    695.47 ±0.23%    602.50 ±0.24%    -13.37%
MresWaitLatency 0.5Pc                        6339.00 ±0.52%  21863.24 ±0.38%    244.90%
MresWaitLatency 0.5Pc Delay1usBeforeSet      3081.19 ±1.14%  12209.38 ±0.36%    296.26%
MresWaitLatency 1.0Pc                       17364.42 ±8.99%  30075.79 ±0.95%     73.20%
MresWaitLatency 1.0Pc Delay1usBeforeSet     33023.26 ±0.67%  35936.94 ±2.12%      8.82%
------------------------------------------  ---------------  ---------------  ---------
Total                                        1026.01 ±1.05%   1536.96 ±0.57%     49.80%

The slight regression on 64 * proc count threads is minor and going to be fixed in master.

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score        Right score      ∆ Score %
------------------------------------------  ----------------  ---------------  ---------
MresWaitDrainRate 00.5Pc                       389.50 ±0.25%    399.08 ±0.54%      2.46%
MresWaitDrainRate 00.5Pc Delay1usBeforeSet     395.61 ±0.21%    401.28 ±0.46%      1.43%
MresWaitDrainRate 01.0Pc                        56.17 ±0.50%    329.86 ±0.44%    487.28%
MresWaitDrainRate 01.0Pc Delay1usBeforeSet      56.13 ±0.29%    328.83 ±0.45%    485.79%
MresWaitDrainRate 02.0Pc                        97.80 ±1.37%    632.89 ±0.14%    547.10%
MresWaitDrainRate 02.0Pc Delay1usBeforeSet     104.50 ±2.49%    636.15 ±0.14%    508.77%
MresWaitDrainRate 04.0Pc                       120.89 ±0.03%    713.66 ±0.15%    490.36%
MresWaitDrainRate 04.0Pc Delay1usBeforeSet     119.04 ±0.14%    710.15 ±0.04%    496.56%
MresWaitDrainRate 16.0Pc                       173.55 ±0.25%    887.21 ±0.19%    411.22%
MresWaitDrainRate 16.0Pc Delay1usBeforeSet     173.79 ±0.23%    894.01 ±0.24%    414.44%
MresWaitDrainRate 64.0Pc                       185.82 ±0.19%    978.36 ±0.12%    426.50%
MresWaitDrainRate 64.0Pc Delay1usBeforeSet     186.14 ±0.02%    974.15 ±0.24%    423.35%
MresWaitLatency 0.5Pc                         1310.54 ±1.16%   8461.05 ±0.71%    545.62%
MresWaitLatency 0.5Pc Delay1usBeforeSet       1257.14 ±1.03%   8400.90 ±0.65%    568.25%
MresWaitLatency 1.0Pc                       11191.01 ±18.41%  29964.58 ±4.80%    167.76%
MresWaitLatency 1.0Pc Delay1usBeforeSet     10193.91 ±20.00%  25211.04 ±4.05%    147.31%
------------------------------------------  ----------------  ---------------  ---------
Total                                          322.96 ±3.13%   1364.19 ±0.85%    322.40%

The spin loop is currently optimized for older processors, that'll be fixed in master as well.

@kouvel kouvel changed the title Don't multiply YieldProcessor count by proc count (#13556) Don't multiply YieldProcessor count by proc count (ManualResetEventSlim, Task) (#13556) Aug 28, 2017
@karelz karelz removed bug Product bug (most likely) tenet-performance Performance related issue labels Sep 6, 2017
@weshaggard weshaggard merged commit d0be5fb into dotnet:release/2.0.0 Sep 8, 2017
@kouvel kouvel deleted the MresSpinFixRel20 branch September 8, 2017 17:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants