added context check in iterativeLookup, solves cache contention issue #407
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #404
Description
More details in #404, but in short
main
saw ZDNS slow to a crawl andpprof
trace of goroutines showed most were waiting on a cache shard lock. I usedgit bisect
to find the first bad version and it fortunately wasn't the large CLI + library split PR (c192044) but a much smaller commit's change-set.Turned out, the first bad version was 67026d, where I tried to improve timeout handling. Specifically, here where I removed the timeout check in
cachedRetryingLookup
. I believe what was happening was that as long as we never went to the wire (where we do check the context again)but stayed just hitting the cache, this execution loop would never time out and the thread would be stuck:By adding a check on the context in
iterativeLookup
, we can break the cycle.Testing
Command -
head -n 7000 10k_crux.input | ./zdns A --iterative --threads=100 --timeout=20 --iteration-timeout=20 --output-file="out.log
Time to Crawl 7k Domains
v1.1.0
: - 1 min. 50 sec.main
: - > 10 mins, gave up at this point. There were 50/7k domains yet to be resolvede5e2d8
- This PR: - 1 min 26 secs