Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add a cache to the cache #1296

Merged
merged 5 commits into from
Aug 22, 2024
Merged

feat: Add a cache to the cache #1296

merged 5 commits into from
Aug 22, 2024

Conversation

kentquirk
Copy link
Contributor

Which problem is this PR solving?

We have a customer who has so much traffic sometimes, that the cuckoo drop cache can't record the dropped IDs as fast as they're dropping them, so it's overrunning its input queue.

This is a frustrating single point of failure because there is a single goroutine responsible for filling this cache which means adding CPUs won't help, and because of trace locality, adding more nodes won't help when a burst of spans comes from a single giant trace. Making the queue larger just means that it will take a little longer to fill up.

The contention is that we write to the cache when we drop a trace, but we read from it for every span that arrives. So if you have a single huge trace, you might fairly quickly decide to drop it, but still have to query the cache tens of thousands of times for new spans. The cuckoo cache is pretty fast but we can make it faster.

Short description of the changes

  • Add a cache in front of the cache (a Set with a TTL of 3 seconds) that buffers only the most recently dropped trace IDs; we check that before we check the cuckoo cache. This set responds quite a bit faster (at least 4x) than the cuckoo cache, and importantly it also prevents lock contention for the cuckoo cache, speeding it up for the cache writes.
  • Tweak the logic for draining the write queue; it benchmarks faster this way.
  • Move the metrics timer inside the lock so we're not measuring the waiting time
  • The function genID used by another benchmark, that I also wanted to use here, was broken so I fixed it.
  • Added a couple of benchmarks I used to prove to myself that the Set was fast enough.

YO DAWG meme

@kentquirk kentquirk requested a review from a team as a code owner August 22, 2024 18:07
@kentquirk kentquirk self-assigned this Aug 22, 2024
@kentquirk kentquirk added this to the v2.8 milestone Aug 22, 2024
Copy link
Contributor

@VinozzZ VinozzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed a little bug in generics.SetWithTTL where it's mixing usage between time.Now and clockwork.Now here. Do you mind to fix it in this PR since we are using SetWithTTL ?

generics/setttl_test.go Show resolved Hide resolved
collect/cache/cuckooSentCache.go Show resolved Hide resolved
@kentquirk kentquirk requested a review from VinozzZ August 22, 2024 19:44
@kentquirk kentquirk merged commit 9f5ea80 into main Aug 22, 2024
5 checks passed
@kentquirk kentquirk deleted the kent.cache_cache branch August 22, 2024 19:49
@cartermp
Copy link
Member

This is a great PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants