You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It can be hard to get a sense of how wrong the cache normally is (e.g. from cache invalidations being lost due to network issues) or notice when application bugs (e.g. writes that don't trigger after_commit) or unknown Identity Cache bugs make it worse.
Proposal
Add support for checking a percentage of cache hits for correctness against the database. This could then be exposed with ActiveSupport::Notifications.instrument, which could be used to get a correctness ratio over time to see regressions being introduced and that could also be split by cache index to notice bugs that affect a subset of cache indexes.
The data loaded from the database can be serialized and compared to the serialized data fetched from the cache. If they differ, then we can attempt to CAS set the data loaded from the database to the cache to both correct the cached value.
Cache invalidations aren't done atomically with the database write, so we should aim to reduce false positives. Detecting CAS set conflicts when correcting the cached value is one way to do this, but it is still possible for a recent database write to be loaded and for the cache invalidation to complete after the cache value is "corrected". As such, we should provide the maximum updated_at timestamp in any cached rows to try to get the age of the database data which can then be used to exclude recently written data when trying to find incorrectness from missing cache invalidations. Note that these timestamps are affected by clock skew and are from the time before the write rather than the commit time. An appropriate threshold for excluding recently written data should include the sum of the maximum expected clock skew, maximum transaction duration and the maximum expected duration to invalidate the cache after the transaction commits; we might just want to conservatively use the hard timeout duration for web requests & jobs for simplicity.
The text was updated successfully, but these errors were encountered:
Problem
It can be hard to get a sense of how wrong the cache normally is (e.g. from cache invalidations being lost due to network issues) or notice when application bugs (e.g. writes that don't trigger after_commit) or unknown Identity Cache bugs make it worse.
Proposal
Add support for checking a percentage of cache hits for correctness against the database. This could then be exposed with
ActiveSupport::Notifications.instrument
, which could be used to get a correctness ratio over time to see regressions being introduced and that could also be split by cache index to notice bugs that affect a subset of cache indexes.The data loaded from the database can be serialized and compared to the serialized data fetched from the cache. If they differ, then we can attempt to CAS set the data loaded from the database to the cache to both correct the cached value.
Cache invalidations aren't done atomically with the database write, so we should aim to reduce false positives. Detecting CAS set conflicts when correcting the cached value is one way to do this, but it is still possible for a recent database write to be loaded and for the cache invalidation to complete after the cache value is "corrected". As such, we should provide the maximum updated_at timestamp in any cached rows to try to get the age of the database data which can then be used to exclude recently written data when trying to find incorrectness from missing cache invalidations. Note that these timestamps are affected by clock skew and are from the time before the write rather than the commit time. An appropriate threshold for excluding recently written data should include the sum of the maximum expected clock skew, maximum transaction duration and the maximum expected duration to invalidate the cache after the transaction commits; we might just want to conservatively use the hard timeout duration for web requests & jobs for simplicity.
The text was updated successfully, but these errors were encountered: