Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cache lock provider allowing multiple locks to be acquired. #299

Merged
merged 28 commits into from
Mar 20, 2024

Conversation

niemyjski
Copy link
Member

@niemyjski niemyjski commented Mar 18, 2024

I noticed that multiple locks could be acquired. I've added more test coverage to try and expose the issue.

Here are some logs that lead me down this path:

44:28.293 T: Got lock released message: test (fvT2ltV31jWIJVbh) - CacheLockProvider
44:28.293 T: Finished calling subscriber action: 65b73751c9eb4a7aa8333f26345b816d - InMemoryMessageBus
44:28.293 T: Done enqueueing message to 1 subscribers for message type Foundatio.Lock.CacheLockReleased, Foundatio - InMemoryMessageBus
44:28.293 T: Added cache key: lock:test Updated=True - InMemoryCacheClient
44:28.293 D: Removing expired cache entry lock:test - InMemoryCacheClient
44:28.293 D: Released lock: test (fvT2ltV31jWIJVbh) - CacheLockProvider
44:28.293 T: Disposed lock (fvT2ltV31jWIJVbh) test - CacheLockProvider
44:28.293 T: Added cache key: lock:test - InMemoryCacheClient
44:28.293 D: Attempting to acquire lock (rMPQS00k5jriIlwD): test - CacheLockProvider
44:28.293 D: Acquired lock (CtO2HC1SbCLWHUWx) test after 0:00:00.1030847 - CacheLockProvider
44:28.293 T: Added cache key: lock:test - InMemoryCacheClient
44:28.293 D: Acquired lock (rMPQS00k5jriIlwD) test after 0:00:00.0000115 - CacheLockProvider
44:28.294 T: Disposing lock (rMPQS00k5jriIlwD) test - CacheLockProvider

After some more fixes to how we update AddAsync

09:53.915 D: Released lock: test (xtpVPARC2g7Rd0Vk) - CacheLockProvider
09:53.915 T: Disposed lock (xtpVPARC2g7Rd0Vk) test - CacheLockProvider
09:53.915 D: Attempting to acquire lock (sBxxiPzf8fHqLLQO): test - CacheLockProvider
09:53.915 T: Added cache key: lock:test Replacing existing expired entry - InMemoryCacheClient
09:53.915 T: Added cache key: lock:test Replacing existing expired entry - InMemoryCacheClient
09:53.915 D: Acquired lock (sBxxiPzf8fHqLLQO) test after 0:00:00.0000345 - CacheLockProvider
09:53.915 D: Acquired lock (cCUuUXR3vPpxuoTd) test after 0:00:00.099548 - CacheLockProvider

Investigated further and saw that it was way more stable, but it failed when maintenance is run

56:59.834 T: DoMaintenance - InMemoryCacheClient
56:59.834 T: Calling subscriber action: 28842d366dca40a694ac2ee823330f5b - InMemoryMessageBus
56:59.834 D: DoMaintenance: Removing expired key lock:test OriginalExpiresAt=uhoLiJcnts5dI5W6 CurrentExpiresAt=uhoLiJcnts5dI5W6 - InMemoryCacheClient
56:59.834 T: Got lock released message: test (uhoLiJcnts5dI5W6) - CacheLockProvider
56:59.834 T: Finished calling subscriber action: 28842d366dca40a694ac2ee823330f5b - InMemoryMessageBus
56:59.834 T: Replacing expired cache key: lock:test Value=O3AMARH32E61q77i ExistingValue=uhoLiJcnts5dI5W6 - InMemoryCacheClient
56:59.834 T: Added cache key: lock:test - InMemoryCacheClient
56:59.834 T: Done enqueueing message to 1 subscribers for message type Foundatio.Lock.CacheLockReleased, Foundatio - InMemoryMessageBus
56:59.834 D: Released lock: test (uhoLiJcnts5dI5W6) - CacheLockProvider
56:59.834 D: Acquired lock (O3AMARH32E61q77i) test after 0:00:00.1024995 - CacheLockProvider
56:59.834 T: Disposed lock (uhoLiJcnts5dI5W6) test - CacheLockProvider
56:59.834 D: Attempting to acquire lock (SHmUz8gq3wswz4cW): test - CacheLockProvider
56:59.834 D: Removing expired cache entry lock:test - InMemoryCacheClient
56:59.834 T: Added cache key: lock:test - InMemoryCacheClient

This led me to rework how maintenance removes expired items.

@niemyjski niemyjski added the bug label Mar 18, 2024
@niemyjski niemyjski self-assigned this Mar 18, 2024

lock (_lock)
using (await _lock.LockAsync().AnyContext())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


lock (_lock)
using (await _lock.LockAsync().AnyContext())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -83,16 +83,15 @@ public async Task<ILock> AcquireAsync(string resource, TimeSpan? timeUntilExpire
bool isTraceLogLevelEnabled = _logger.IsEnabled(LogLevel.Trace);
bool isDebugLogLevelEnabled = _logger.IsEnabled(LogLevel.Debug);
bool shouldWait = !cancellationToken.IsCancellationRequested;
if (isDebugLogLevelEnabled)
_logger.LogDebug("Attempting to acquire lock: {Resource}", resource);
string lockId = GenerateNewLockId();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this up so all lock messages can have LockId.

@@ -907,7 +940,7 @@ public void Dispose()
ItemExpired?.Dispose();
}

private class CacheEntry
private record CacheEntry
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this into a record, I had a hunch this could be causing an issue with ConcurrentDictionary.TryUpdate() as it uses a comparison value with IEquatable... https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentdictionary-2.tryupdate?view=net-8.0

@@ -884,13 +911,19 @@ private async Task DoMaintenanceAsync()

var utcNow = SystemClock.UtcNow.AddMilliseconds(50);

// Remove expired items and items that are infrequently accessed as they may be updated by add.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cache entries that are very recently modified, there might be a high churn rate and the cache value might get overridden via Add/Set. So don't touch it in the background clean up job for a little bit. This made the tests much more stable.


(string Key, long LastAccessTicks, long InstanceNumber) oldest = (null, Int64.MaxValue, 0);
foreach (var kvp in _memory)
{
if (kvp.Value.LastAccessTicks < oldest.LastAccessTicks
|| (kvp.Value.LastAccessTicks == oldest.LastAccessTicks && kvp.Value.InstanceNumber < oldest.InstanceNumber))
bool isExpired = kvp.Value.ExpiresAt < SystemClock.UtcNow;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimized this to grab the first expired one (and break), this doesn't take into account the frequency period like maintenance loop.


string expiredKey = null;
lock (_lock)
using (await _lock.LockAsync().AnyContext())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -825,9 +849,9 @@ public async Task SetExpirationAsync(string key, TimeSpan expiresIn)
}

Interlocked.Increment(ref _writes);
if (_memory.TryGetValue(key, out var value))
if (_memory.TryGetValue(key, out var existingEntry))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed them as there was like 3 or 4 different names for existing cache entries and it was driving me crazy while debugging.

if (cacheEntry.ExpiresAt < SystemClock.UtcNow)
if (existingEntry.ExpiresAt < SystemClock.UtcNow)
{
Interlocked.Increment(ref _misses);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a miss, not a hit.

if (addOnly)
{
if (!_memory.TryAdd(key, entry))
_memory.AddOrUpdate(key, entry, (_, existingEntry) =>
Copy link
Member Author

@niemyjski niemyjski Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted this to an AddOrUpdate as it can better handle concurrent attempts. Learned a few things from https://medium.com/gft-engineering/correctly-using-concurrentdictionarys-addorupdate-method-94b7b41719d6

Add call back can be hit multiple times, so just pass it in and update in the update value callback to reset wasUpdated.

return entry;
}
if (isTraceLogLevelEnabled)
_logger.LogTrace("Replacing expired cache key: {Key}", key);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logging to detect this, which was very helpful

Comment on lines +199 to +204
if (_memory.TryGetValue(key, out var existingEntry) && existingEntry.ExpiresAt < SystemClock.UtcNow)
{
if (_memory.TryRemove(key, out var removedEntry))
{
if (removedEntry.ExpiresAt >= SystemClock.UtcNow)
throw new Exception("Removed item was not expired");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actively removing expired items because this method wouldn't get inlined / executed synchronously all the time when this method was called. So concurrent Add/Set calls could update the expired value causing issues. I think this could be better reworked to have a single point for all removes.

{
RemoveExpiredKey(key);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let maintenance remove the item to cut down on the number concurrent updates.

if (value.ExpiresAt >= SystemClock.UtcNow)
return Task.FromResult<TimeSpan?>(value.ExpiresAt.Subtract(SystemClock.UtcNow));

RemoveExpiredKey(key);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let maintenance remove the item to cut down on the number concurrent updates.

@@ -189,26 +195,33 @@ public Task<int> RemoveByPrefixAsync(string prefix)

internal void RemoveExpiredKey(string key, bool sendNotification = true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to only call this if we call a method with a past expiration and we return immediately. Otherwise just let maintenance remove the item to cut down on the number concurrent updates.

@niemyjski
Copy link
Member Author

niemyjski commented Mar 20, 2024

minimum-changes.patch

@ejsmith Here is the minimum changes required to fix this.

image image image image

@niemyjski niemyjski marked this pull request as ready for review March 20, 2024 21:22
Copy link
Contributor

@ejsmith ejsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just kind of skimming through and not seeing any issues. Will take your word and the tests passing as proof.

@niemyjski niemyjski merged commit da124a7 into master Mar 20, 2024
3 checks passed
@niemyjski niemyjski deleted the bug/cache-lock-provider-acquiring-mutliple-locks branch March 20, 2024 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants