Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent NONEEDEDGLUE errors and add cache statistics #456

Merged
merged 22 commits into from
Sep 30, 2024

Conversation

phillip-stephens
Copy link
Contributor

@phillip-stephens phillip-stephens commented Sep 27, 2024

Description

Prior to this change, we were adding Authorities and Additionals as separate records in the cache. This meant that the Authorities was inserted as an NS record, but the Additionals were inserted as separate A/AAAA records. Separately, our cache is an LRU cache with 4096 shards, 2 entries per shard by default. What this means is that we have no guarantees that all of a domain's Authorities + Additionals are in the cache, since a partial set can be evicted.

This loss of accuracy was what caused the NONEEDEDGLUE records to pop up, especially as more domains in the same run.

Changes

  • Add CacheStats that capture cache hits, misses, evictions with --verbosity=5. No performance impact if --verbosity < 5
    • Ex: time="2024-09-27T20:28:52Z" level=debug msg="Cache statistics: hits=29870 misses=85581 adds=30790 ejects=21604 hitRate=25.872448% missRate=74.127552%"
  • Switched to using a single SingleQueryResult pointer vs. returning the struct since each function would have to create a new structure and copy the internals. The pointer avoids this.
  • Re-wrote cache so that an entire Authority record or `Answer record are stored atomically
    • Authority records store the authority's for a certain domain - .com
    • Answer records store the authoritative answers for a certain query - A google.com

Performance

Tested with 10000 top domains, A --iterative --threads=100 --verbosity=5

main

Runtime - 37.2 s
Allocs - 2199 MB total allocations (this is not peak memory use since the Garbage Collector cleans up un-used memory, this is just total allocated objects)
NOERROR - 9919
TIMEOUT/ITERATIVE_TIMEOUT - 5
NXDOMAIN - 54

Branch

Runtime - 33.9 s
Allocs - 1216 MB
NOERROR - 9916
TIMEOUT/ITERATIVE_TIMEOUT - 9
NXDOMAIN - 55

Main takeaway is there was a sharp reduction in allocations, however the runtime reduction wasn't that large (and likely within the margin of error, other A/B tests didn't show such a large delta), so it seems garbage collection is not our bottleneck.

Either way, I think this is a needed change for cache use.

@phillip-stephens phillip-stephens marked this pull request as ready for review September 27, 2024 21:21
@phillip-stephens phillip-stephens requested a review from a team as a code owner September 27, 2024 21:21
@@ -530,6 +533,7 @@ func Run(gc CLIConf) {
close(outChan)
close(metaChan)
routineWG.Wait()
resolverConfig.Cache.Stats.PrintStatistics()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move into metadata

@phillip-stephens phillip-stephens merged commit c9f92e1 into main Sep 30, 2024
3 checks passed
@phillip-stephens phillip-stephens deleted the phillip/fix-cache-lookups-iterative branch September 30, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants