-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
4d560ef
to
b2055e7
Compare
@vancem, would you be able to validate the design? If the design is ok then maybe @tarekgh or @stephentoub could review the code. |
I will take a look later today. |
Sorry for the I started to look at this on Tues, but got preempted. I will continue today. |
@dotnet-bot test Ubuntu x64 Formatting |
b2055e7
to
8c2268b
Compare
src/debug/daccess/request.cpp
Outdated
@@ -3726,7 +3726,7 @@ ClrDataAccess::GetSyncBlockData(unsigned int SBNumber, struct DacpSyncBlockData | |||
} | |||
#endif // FEATURE_COMINTEROP | |||
|
|||
pSyncBlockData->MonitorHeld = pBlock->m_Monitor.m_MonitorHeld; | |||
pSyncBlockData->MonitorHeld = pBlock->m_Monitor.m_lockState.VolatileLoad().GetMonitorHeldState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like 'ProbablyMonitorHeld to be a property and then define that to be m_lockState.VolatileLoad().GetMonitorHeldState(). In general I would like m_lockState to be private to AwareLock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I should have done that, will do
src/inc/clrconfigvalues.h
Outdated
@@ -675,7 +675,7 @@ RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"), | |||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default) | |||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcFactor, W("SpinLimitProcFactor"), 0x4E20, "Hex value specifying the multiplier on NumProcs to use when calculating the maximum spin duration", EEConfig_default) | |||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitConstant, W("SpinLimitConstant"), 0x0, "Hex value specifying the constant to add when calculating the maximum spin duration", EEConfig_default) | |||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinRetryCount, W("SpinRetryCount"), 0xA, "Hex value specifying the number of times the entire spin process is repeated (when applicable)", EEConfig_default) | |||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinRetryCount, W("SpinRetryCount"), 0x0, "Hex value specifying the number of times the entire spin process is repeated (when applicable)", EEConfig_default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now is only used in UTSemReadWrite. Is that your intent? Seems like we have not benchmarked the effect of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I should revert this change (it has no impact on monitor anymore). I had filed an issue to deal with other spin loops that use the same constants but I forgot about the impact of this change, thanks for pointing it out.
a0c4f95
to
b5c407c
Compare
@dotnet-bot test CentOS7.1 x64 Release Priority 1 Build and Test |
src/inc/clrconfigvalues.h
Outdated
@@ -675,7 +675,7 @@ RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"), | |||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that your change is not about tuning the spin counts, and this comment should certainly not block merge, but of all the numbers that is 'out of wack' I think the SpinLimitProcCap is one that causes grief (on machines with > 8 procs). The fact that you have more processors really is only weakly correlated with how valuable additional spinning might be (It requires all the processors to be beating on the same lock, which frankly is never going to work well, so we should not be tuning for it). Simply changing the cap to something 8, (I would prefer 4) would make me significantly happier. Consider experimenting with this when you have the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently working on reducing the spin counts. I was planning on removing the proc count multiplier altogether and to just make it a simple configurable max number of iterations, I found that it didn't work very well in other cases. It seems to be like more procs is not a good reason by itself to spin more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have always been suspicious that the Proc-count multiplier was causing WAY too much spinning (at least for large CPU count machines). Theory says you should spin for the expected time of the delay. Now the more CPUs contending on the same lock (that is otherwise held for a very short time), would argue for more spinning, but probably not by the FACTOR the MAXIMUM number of CPUs possible (if you have more than 4 CPU constantly contending for a single lock you already have a scalability problem to fix, so increasing the spinning by more than that seems pointless).
For what it is worth, I still have not given up on the idea that the amount of spinning should be based on HOW LONG IT TOOK THE LAST TIME. Unlike the reader-writer lock (where we didn't want the complexity of doing Stopwatch/QPC timing, you CAN do that easily for monitor. Thus snap a timestamp when contention starts and stop (when the wait wakes up from any spinning/waits), and you now have PRECISE estimate of a good amount of time to spin. Keep a running average and you should have a VERY good amount for THAT lock, which seems like something close to optimal. If this time is over some threshold (somewhere between 1 and 30usec) bail, and don't bother spinning at all. This is what I would suggest.
You would only do this for AwareLocks, but frankly any lock that you DO wait on become and awarelock (indeed I would argue that if the first Interlock does not succeed, and the aware lock exists (that is you waited in the past), then you should do this history-based amount of spinning (since you have place to put the timestamps and other logic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the more CPUs contending on the same lock (that is otherwise held for a very short time), would argue for more spinning, but probably not by the FACTOR the MAXIMUM number of CPUs possible
I may be better to spin a little more, or it may be better to spin a bit less and have more of those threads wait. Since only one thread can hold the lock at a time, it may also be better to cap the total number of spinners to some low number. I'll do some experiments around these in the future if I have some time.
For what it is worth, I still have not given up on the idea that the amount of spinning should be based on HOW LONG IT TOOK THE LAST TIME.
Agreed, this could be the best thing to do for monitor. I plan on looking into that idea as part of issue https://github.com/dotnet/coreclr/issues/13981 once we have a working low spin count that the dynamic tuning can work with.
@kouvel Is there a way to provide a measure of "balance" (fairness) in the testing? @tannergooding Your machine is 16C/32T right? I'm interested in seeing hard numbers for bigger machines, if someone is able to do that. |
Yeah, I have a 8c/16t and a 16c/32t. I can try to test this weekend |
The latency tests give a measure of how quickly another spinning/waiting thread responds to the lock being released. To measure fairness I can try to add tests that measure the total wait duration including spinning and waiting portions. I expect fairness would decrease with this change, at least it should decrease since that's the idea. |
@tannergooding thanks that would be great, I can send you PGO'ed binaries (profiled only on some of the tests above) and my perf harness for the tests too if that would be useful. |
b5c407c
to
4bb7a6f
Compare
8bed8e9
to
34eebbf
Compare
34eebbf
to
071d392
Compare
@dotnet-bot test Ubuntu armlb Cross Debug Innerloop Build |
071d392
to
c02b433
Compare
Part 1: Improve Monitor scaling Fixes https://github.com/dotnet/coreclr/issues/13978 - Refactored AwareLock::m_MonitorHeld into a class LockState with operations to mutate the state - Allowed the lock to be taken by a non-waiter when there is a waiter to prevent creating lock convoys - Added a bit to LockState to indicate that a waiter is signaled to wake, to avoid waking more than one waiter at a time. A waiter that wakes by observing the signal unsets this bit. See AwareLock::EnterEpilogHelper(). - Added a spinner count to LockState. Spinners now register and unregister themselves and lock releasers don't wake a waiter when there is a registered spinner (the spinner guarantees to take the lock if it's available when unregistering itself) - This was necessary mostly on Windows to reduce CPU usage to the expected level in contended cases with several threads. I believe it's the priority boost Windows gives to signaled threads, which seems to cause waiters to much more frequently succeed in acquiring the lock. This causes a CPU usage problem because once the woken waiter releases the lock, on the next lock attempt it will become a spinner. This keeps repeating, converting several waiters into spinners unnecessarily. Before registering spinners, I saw typically 4-6 spinners under contention (with delays inside and outside the lock) when I expected to have only 1-2 spinners at most. - It costs an interlocked operation before and after the spin loop, doesn't seem to be too significant since spinning is a relatively slow path anyway, and the reduction in CPU usage in turn reduces contention on the lock and lets more useful work get done - Updated waiters to spin a bit before going back to waiting, reasons are explained in AwareLock::EnterEpilogHelper() - Removed AwareLock::Contention() and any references (this removes the 10 repeats of the entire spin loop in that function). With the lock convoy issue gone, this appears to no longer be necessary. Perf - On Windows, throughput has increased significantly starting at slightly lower than proc count threads. On Linux, latency and throughput have increased more significantly at similar proc counts. - Most of the larger regressions are in the unlocked fast paths. The code there hasn't changed and is almost identical (minor layout differences), I'm just considering this noise until we figure out how to get consistently faster code generated. - The smaller regressions are within noise range Part 2: Reduce Monitor spinning Fixes https://github.com/dotnet/coreclr/issues/13980 - Added new config value Monitor_SpinCount and Monitor spins for that many iterations, default is 30 (0x1e). This seems to give a somewhat decent balance between latency, fairness, and throughput. Lower spin counts improve latency and fairness significantly and regress throughput slightly, and higher spin counts improve throughput slightly and regress latency and fairness significantly. - The other constants can still be used to disable spinning but otherwise they are no longer used by Monitor - Decreased the number of bits used for tracking spinner count to 3. This seems to be more than enough since only one thread can take a lock at a time, and prevents spikes of unnecessary CPU usage. Tried some things that didn't pan out: - Sleep(0) doesn't seem to add anything to the spin loop, so left it out. Instead of Sleep(0) it can just proceed to waiting. Waiting is more expensive than Sleep(0), but I didn't see that benefit in the tests. Omitting Sleep(0) also keeps the spin loop very short (a few microseconds max). - Increasing the average YieldProcessor() duration per spin iteration improved thorughput slightly but regressed latency and fairness very quickly. Given that fairness is generally worse with part 1 of this change above, it felt like a better compromise to take a small reduction in throughput for larger improvements in latency and fairness. - Tried adding a very small % of lock releases by random wake a waiter despite there being spinners to improve fairness. This improved fairness noticeably but not as much as decreasing the spin count slightly, and it was making latency and throughput worse more quickly. After reducing the % to a point where I was hardly seeing fairness improvements, there were still noticeable latency and throughput regressions. Miscellaneous - Moved YieldProcessorNormalized code into separate files so that they can be included earlier and where needed - Added a max for "optimal max normalized yields per spin iteration" since it has a potential to be very large on machines where YieldProcessor may be implemented as no-op, in which case it's probably not worth spinning for the full duration - Refactored duplicate code in portable versions of MonEnterWorker, MonEnter, and MonReliableEnter. MonTryEnter has a slightly different structure, did not refactor that. Perf - Throughput is a bit lower than before at lower thread counts and better at medium-high thread counts. It's a bit lower at lower thread counts because of two reasons: - Shorter spin loop means the lock will be polled more frequently because the exponential backoff does not get as high, making it more likely for a spinner to steal the lock from another thread, causing the other thread to sometimes wait early - The duration of YieldProcessor() calls per spin iteration has decreased and a spinner or spinning waiter are more likely to take the lock, the rest is similar to above - For the same reasons as above, latency is better than before. Fairness is better on Windows and worse on Linux compared to baseline due to the baseline having differences between these platforms. Latency also has differences between Windows/Linux in the baseline, I suspect those are due to differences in scheduling. - Performance now scales appropriately on processors with different pause delays
c02b433
to
9f94b68
Compare
Added a commit that mitigates waiter starvation and rebased other commits. Normally, threads are allowed to preempt waiters to acquire the lock. There are cases where waiters can be easily starved as a result. For example, a thread that holds a lock for a significant amount of time (much longer than the time it takes to do a context switch), then releases and reacquires the lock in quick succession, and repeats. Though a waiter would be woken upon lock release, usually it will not have enough time to context-switch-in and take the lock, and can be starved for an unreasonably long duration. In order to prevent such starvation and force a bit of fair forward progress, it is sometimes necessary to change the normal policy and disallow threads from preempting waiters. A new bit was added to LockState and ShouldNotPreemptWaiters() indicates the current state of the policy.
When unreasonable starvation is occurring, the lock will be released occasionally and if caused by spinners, spinners will be starting to spin.
There was no change to perf from before. In the example above, where before the change, a waiter is starved for a very long time (in the order of 10s of seconds to minutes), after the change the waiter acquires the lock once per second. |
9f94b68
to
fef475e
Compare
Normally, threads are allowed to preempt waiters to acquire the lock. There are cases where waiters can be easily starved as a result. For example, a thread that holds a lock for a significant amount of time (much longer than the time it takes to do a context switch), then releases and reacquires the lock in quick succession, and repeats. Though a waiter would be woken upon lock release, usually it will not have enough time to context-switch-in and take the lock, and can be starved for an unreasonably long duration. In order to prevent such starvation and force a bit of fair forward progress, it is sometimes necessary to change the normal policy and disallow threads from preempting waiters. A new bit was added to LockState and ShouldNotPreemptWaiters() indicates the current state of the policy. - When the first waiter begins waiting, it records the current time as a "waiter starvation start time". That is a point in time after which no forward progress has occurred for waiters. When a waiter acquires the lock, the time is updated to the current time. - Before a spinner begins spinning, and when a waiter is signaled to wake, it checks whether the starvation duration has crossed a threshold (currently one second) and if so, sets ShouldNotPreemptWaiters() When unreasonable starvation is occurring, the lock will be released occasionally and if caused by spinners, spinners will be starting to spin. - Before starting to spin, if ShouldNotPreemptWaiters() is set, the spinner will skip spinning and wait instead. Spinners that are already registered at the time ShouldNotPreemptWaiters() is set will stop spinning as necessary. Eventually, all spinners will drain and no new ones will be registered. - After spinners have drained, only a waiter will be able to acquire the lock. When a waiter acquires the lock, or when the last waiter unregisters itself, ShouldNotPreemptWaiters() is cleared to restore the normal policy.
fef475e
to
98d7dbc
Compare
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test |
@@ -4427,69 +4427,18 @@ NOINLINE static void JIT_MonEnter_Helper(Object* obj, BYTE* pbLockTaken, LPVOID | |||
} | |||
|
|||
/*********************************************************************/ | |||
NOINLINE static void JIT_MonContention_Helper(Object* obj, BYTE* pbLockTaken, LPVOID __me) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to rationalize the value of this MonContention helper. This is uniformly called as a second last chance' before bailing and calling the 'slow' MonEnter_Helper path. Why do we not simply skip this and call the slow helper and be done with it? At this point we don't care about CPU as we need to kill some time anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow, as that's kind of what I did, I removed the MonContention helper. The main purpose of it before the changes was to repeat the entire spin loop 10 more times before waiting. I suspect it was necessary mainly to avoid lock convoys, as spinning more helps to avoid them with the previous design. After the changes, there is a first-level fast path check for the unlocked case (one try for the lock), followed by a spin loop of a default of 30 iterations, then the slow path that waits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my mistake ignore my comment. I was looking at the wrong branch on my machine and confused myself about that fact that this code is being removed.
SyncBlock *syncBlock = g_pSyncTable[oldValue & MASK_SYNCBLOCKINDEX].m_SyncBlock; | ||
_ASSERTE(syncBlock != NULL); | ||
AwareLock *awareLock = &syncBlock->m_Monitor; | ||
if (awareLock->EnterHelper(pCurThread, true /* checkRecursiveCase */)) | ||
|
||
AwareLock::EnterHelperResult result = awareLock->TryEnterBeforeSpinLoopHelper(pCurThread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that what I am about to comment on is not part of the delta in this PR, but I am trying to rationalize the whole code flow here. My mental model for the EnterObjMonitorHelperSpin helper is that it was there to avoid making an awareLock (otherwise even ONE contention on a lock at any time would force the awareLock to be created). However we have an awareLock at this point, so the question is why can't we just return UseSlowPath and avoid this code? This code should be a clone of what is done in AwareLock, and the extra cost of getting there (making an awareLock (but we have one) and setting up a transition frame (but we need to spin anyway), should not be an issue. It would be nice to simply remove this code. What prevents this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that, it's kind of a pity to have to force-inline the moderately slow-path
spinning code (which it used to do in the asm helpers to avoid calls). The overhead of the transition is not something I fully understand but my high level guess is that taking the slow path is equivalent in overhead to a p/invoke, which is fairly high. I'm not sure if it will matter though, probably worth a test. Moving spinning to the slow path would also significantly improve register allocation in the fast path since it removes the last offending call. I plan on looking into this more closely as part of https://github.com/dotnet/coreclr/issues/14571.
Yes avoiding creating the aware lock is another advantage of spinning more (when it's a thin lock anyway). With this change a monitor would transition into an aware lock much more easily, there are perf/memory tradeoffs involved and I think this is a decent one.
Spinning can't be avoided altogether once we have an aware lock. Spinning just a little bit significantly improves perf under contention and avoids excessive context switching in scenarios involving smaller time scales, which I'm sure are very common.
Regarding moving the spin path to the slow path, I don't think it matters whether we have an an aware lock already or not, it's the same kind of situation for a thin lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, Yes, a HelperMethodFrame transition is like a PINVOKE, but it is slower (because it does something called epilog walking to set up the frame). I have not measured it recently, but a normal optimized PINVOKE is 20-30 clock ticks. HelperMethodFrames are probably 2-3 times that. There is some goo to get from there to the AwareLock.Enter() where we again try to take the lock with in interlocked operation. But it seems like that overhead is basically on the same order as what we would spin for, so we should not be too concerned about the overhead.
As you say it has the benefit of making the register allocation leaner for the paths that remain. But I also care about the fact that we are cloning non-trivial code (we have a very similar spin wait loop in EnterEpilogHelper). The less cloned code we have the better (for example we don't do the ETW contention logging in TryEnterObjMonitorSpinHelper, and arguably we should. It would also be SO easy for the logic that counts spinners to be off in one place but not the other).
I agree we need spinning in the aware-lock case. I just want it done one place if possible, and it seems like we are doing it in EnterEpilogHelper so that should be enough.
The value of trying this now, is that we don't have to deal with any bugs associated with the code we are about to remove. I suppose it could wait until #14571, but if you see no issue, and removing it does not cause regressions it is a very nice improvement in my opinion...
On the issue of spinning to avoid creating an awareLock: I would not spin more just to avoid Aware lock creation. If spinning just a bit more is what saved you, odds are that sooner or later you will 'fall off' that edge and need to make the lock. The key part to handle the case where this is little to no contention (which is common in many cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take back some of what I said above. I see that the logic in EnterEpilogHelper does the Wait before is does any spinning (it does the spinning on waking up). So you can't just remove the code in EnterEpilogHelper without losing spinning in the AwareLock case before waiting.
I must admit I would prefer that it work the way I thought it worked (that there is no logic in EnterObjMonitorHelperSpin to AwareLock (we always bail to the slow helper)). This would require that AwareLock.EnterEpilog to spin before waiting (it is the same loop, you just spin before you wait). Thus all AwareLock logic is now centralized (LockState actually does no need to be in a header-file, it is only used in AwareLock's code in SyncBlk.cpp).
But that is not a trivial change (not huge, but not trivial), and would need to be perf-tested. I will not ask you to make that change to get this fix check in.
Please do consider it, however. I think it would simplify the code and as you mention, may help #14571.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it is worth, it seems that EnterObjMonitorHelper EnterObjMonitorHelperSpin are only called from one place TryEnterObjMonitorSpinHelper, and there are some tricky conditions that are cloned in each of these methods. You could effectively combine them to EnterObjMonitorHelperSpin which pretty much just EnterObjMonitorHelper with a loop that calls SpinWait
This of course only works well if we first kick all the AwareLock stuff out of EnterObjMonitorHelperSpin suggested above.
While I like this idea, I am not suggesting you do it, just giving you food for thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE HelperMethodFrame overhead, the overhead may even be worth it as it would add a potentially beneficial delay before the first try in the spin loop. Can't be sure until I try it.
The value of trying this now, is that we don't have to deal with any bugs associated with the code we are about to remove. I suppose it could wait until #14571, but if you see no issue, and removing it does not cause regressions it is a very nice improvement in my opinion...
I would like to keep this effort separate from this PR due to time constraints. My estimation is that this is not a short task. If it would be beneficial in avoiding bugs, we could consider postponing this PR but in that event a smaller PR that fixes the YieldProcessor issue for Skylake processors would be necessary in the interim in case the full set of changes don't make it into 2.1.
But I also care about the fact that we are cloning non-trivial code (we have a very similar spin wait loop in EnterEpilogHelper). The less cloned code we have the better
It would also be SO easy for the logic that counts spinners to be off in one place but not the other
The spin loops in EnterObjMonitorHelperSpin and EnterEpilogHelper are distinct and most of their logic is not shared:
- EnterObjMonitorHelperSpin spin loop for AwareLock
- Before spin loop, atomically attempt to acquire lock or register spinner (must not register spinner if lock is available)
- In spin loop, if lock is available, atomically acquire lock and unregister spinner
- After spin loop, unregister spinner, but if lock was available at the time of unregistering the spinner, the lock must be acquired and the spinner unregistered atomically unless the lock is stolen by another thread, the comment there explains why
- EnterEpilogHelper spin loop
- Precondition: waiter is registered
- In spin loop, if lock is available, atomically acquire lock and clear waiter-signaled-to-wake bit
- After spin loop, clear waiter-signaled-to-wake bit, but if lock was available at the time of clearing the bit, the lock must be acquired and the waiter unregistered atomically unless the lock is stolen by another thread. If lock is not acquired, it will proceed to waiting again.
There are also differences relevant to mitigating waiter starvation. There is some similarly in the flow and it's likely possible to refactor the flow to be shared, but to me it doesn't seem worth it.
we don't do the ETW contention logging in TryEnterObjMonitorSpinHelper, and arguably we should
This would be far too verbose. Considering how short the spin duration is, from the perspective of the event I think it's fair to consider that there is no contention until waiting is necessary. It may even already be too verbose compared to before, but it's something that can be fixed later if it turns out to be an issue.
For what it is worth, it seems that EnterObjMonitorHelper EnterObjMonitorHelperSpin are only called from one place TryEnterObjMonitorSpinHelper, and there are some tricky conditions that are cloned in each of these methods.
Agreed, and there's certainly more refactoring that can be done to simplify this code. At this point though, I think a separate PR would be appropriate for those kinds of changes.
You could effectively combine them to EnterObjMonitorHelperSpin which pretty much just EnterObjMonitorHelper with a loop that calls SpinWait
This is actually what I tried initially but this showed significant perf regressions on Linux and I decided to keep them separate and not inline the spin loop. On Windows there was not much difference in perf. Since we're thinking about moving the spinning to the slow helper path anyway, it may make sense to keep them separate at least until that idea can be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed issue https://github.com/dotnet/coreclr/issues/15095 to track monitor code cleanup
I have set my approval for this change. I do want to spend a bit more time looking over the code (I am basically creating a PR where I put a bunch of comments in to document the design), but what is clear however is that this is a good step in the right direction. We should proceed. |
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
Thanks @vancem, I'll go ahead and merge the changes once everything passes. Let me know if you have any other feedback or if some parts could use more documentation. |
Fixes https://github.com/dotnet/coreclr/issues/13978
Perf
Processor for numbers below: Core i7 6700 4-core 8-thread
Monitor spin tests
Raw numbers
Code is in https://github.com/dotnet/coreclr/issues/13978
Default spin heuristics
Lower spin counts
Spinning disabled