Skip to content
This repository has been archived by the owner on Nov 1, 2020. It is now read-only.

Port normalized SpinWait from CoreCLR #7569

Merged
merged 4 commits into from
Jul 1, 2019

Conversation

MichalStrehovsky
Copy link
Member

This ports yieldprocessornormalized.cpp from CoreCLR to tune spin wait loops in the post-Skylake CPU world.

The first commit is literal copy of yieldprocessornormalized.cpp from CoreCLR that was then tweaked to build for Redhawk.

I then replaced uses of YieldProcessor with the Skylake-aware compat shim.

@MichalStrehovsky
Copy link
Member Author

MichalStrehovsky commented Jun 29, 2019

Used Kount's SpinPerf code to collect numbers. Had to tweak it because it didn't build (missing method).

This is on Xeon E3-1245 v6 @ 3.70 GHz. I can dig out my Haswell laptop from a closet if we want pre-Skylake numbers. CoreCLR is 3.0 Preview 5 because that's what I had conveniently installed.

  Before After Delta CoreCLR
BarrierSyncRate 1PcT 26.86 29.25 9% 28.28
ConcurrentQueueThroughput 1PcT 9082.13 9271.87 2% 9325.11
ConcurrentStackThroughput 1PcT 6251.63 6430.59 3% 6301.28
MresWaitDrainRate 1PcT 40.33 54.04 34% 52.21
MresWaitDrainRate 2PcT 25.37 40.26 59% 37.47
MresWaitLatency 1PcT 439.94 445.96 1% 484.34
SemaphoreSlimLatency 1PcT 1756.35 2284.01 30% 1796.78
SemaphoreSlimLatency 2PcT 1667.21 2202.48 32% 1734.17
SemaphoreSlimThroughput 1PcT 2599.12 2733.44 5% 2286.15
SemaphoreSlimWaitDrainRate 1PcT 40.18 54.14 35% 51.46
SemaphoreSlimWaitDrainRate 2PcT 24.79 36.92 49% 37.54
SpinLockLatency 1PcT 4211.62 4607.28 9% 4207.46
SpinLockLatency 2PcT 4028.39 4490.54 11% 4019.35
SpinLockThroughput 1PcT 4779.73 5089 6% 4765.36

@jkotas
Copy link
Member

jkotas commented Jun 29, 2019

Unix needs some extra fixes...

@Suchiman
Copy link
Contributor

Suchiman commented Jun 30, 2019

Here are some raw Haswell (4670K @4.4GHz) numbers:
Before:

BarrierSyncRate 1PcT
70,341996
69,915073
71,035333
70,248012

ConcurrentQueueThroughput 1PcT
9605,155970
9621,879857
9516,166882
9359,665356

ConcurrentStackThroughput 1PcT
7179,575568
6334,053947
7170,971167
7188,655256

MresWaitDrainRate 1PcT
131,845465
132,506988
130,984297
134,758706

MresWaitDrainRate 2PcT
73,708220
74,208652
73,091992
73,820528

MresWaitLatency 1PcT
512,637230
538,457943
499,153861
531,477160

SemaphoreSlimLatency 1PcT
1877,195241
1869,491168
1868,935959
1865,377957

SemaphoreSlimLatency 2PcT
1828,077844
1798,082668
1828,532467
1824,753298

SemaphoreSlimThroughput 1PcT
1954,290373
1957,994458
1958,056185
1958,026181

SemaphoreSlimWaitDrainRate 1PcT
114,216970
113,210307
115,019391
113,714408

SemaphoreSlimWaitDrainRate 2PcT
67,041994
66,485955
66,436042
66,379332

SpinLockLatency 1PcT
5323,774612
5380,206025
5378,140305
5354,159621

SpinLockLatency 2PcT
5381,776164
5380,976931
5348,939410
5383,719561

SpinLockThroughput 1PcT
5842,048544
5848,857409
5847,233789
5858,883928

After:

BarrierSyncRate 1PcT
66,449858
68,003405
66,470811
68,208431

ConcurrentQueueThroughput 1PcT
9157,934637
9126,206760
9086,941639
9050,571662

ConcurrentStackThroughput 1PcT
7125,358564
7101,395783
7126,153424
7105,813184

MresWaitDrainRate 1PcT
134,690538
129,931941
134,056907
129,204834

MresWaitDrainRate 2PcT
74,092911
73,842070
74,251382
73,886536

MresWaitLatency 1PcT
552,165670
598,423027
553,881599
594,999767

SemaphoreSlimLatency 1PcT
1750,622296
1747,146622
1748,422989
1753,711839

SemaphoreSlimLatency 2PcT
1739,382888
1765,899388
1687,785219
1772,881071

SemaphoreSlimThroughput 1PcT
2096,872421
2097,618508
2103,964933
2096,765015

SemaphoreSlimWaitDrainRate 1PcT
114,891985
113,582622
114,334818
112,741161

SemaphoreSlimWaitDrainRate 2PcT
67,997962
54,668996
67,851640
66,773430

SpinLockLatency 1PcT
5277,165098
5212,697389
5271,188661
5243,604638

SpinLockLatency 2PcT
5274,098585
5251,157196
5264,401063
5260,326823

SpinLockThroughput 1PcT
5938,249506
5927,238345
5882,466624
5931,865108

CoreCLR (3.0 Preview 6)

BarrierSyncRate 1PcT
74,207863
77,278749
79,577568
76,282353

ConcurrentQueueThroughput 1PcT
8734,131936
8887,539613
8965,578464
8766,309302

ConcurrentStackThroughput 1PcT
7036,496254
7056,767426
7069,026607
7058,440651

MresWaitDrainRate 1PcT
128,234249
134,551936
131,005640
133,318340

MresWaitDrainRate 2PcT
69,374806
72,200551
70,507173
72,221059

MresWaitLatency 1PcT
321,758224
303,178902
326,289792
313,272428

SemaphoreSlimLatency 1PcT
1803,898876
1832,005398
1827,538311
1841,242495

SemaphoreSlimLatency 2PcT
1746,636562
1761,714577
1750,910690
1768,413738

SemaphoreSlimThroughput 1PcT
2359,041317
2414,732205
2408,315563
2412,433632

SemaphoreSlimWaitDrainRate 1PcT
113,799425
115,255330
113,245960
115,304856

SemaphoreSlimWaitDrainRate 2PcT
64,515150
53,949951
64,472743
65,634167

SpinLockLatency 1PcT
4468,704422
4597,162531
4590,935598
4518,089804

SpinLockLatency 2PcT
4533,679118
4589,707419
4604,564341
4567,319241

SpinLockThroughput 1PcT
5385,665111
5550,878978
5565,250713
5552,709845

@jkotas jkotas merged commit b49f959 into dotnet:master Jul 1, 2019
@MichalStrehovsky MichalStrehovsky deleted the spinWait branch July 1, 2019 07:53
@MichalStrehovsky
Copy link
Member Author

Thanks for the measurements @Suchiman!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants