-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive queue for staging dials #237
Conversation
This patch introduces an adaptive dial queue that spawns a dynamically sized set of goroutines to preemptively stage dials for later handoff to the DHT protocol for RPC. It identifies backpressure on both ends (dial consumers and dial producers), and takes compensating action by adjusting the worker pool. We start with `DialQueueMinParallelism` number of workers (6), and scale up and down based on demand and supply of dialled peers. The following events trigger scaling: - we scale up when we can't immediately return a successful dial to a new consumer. - we scale down when we've been idle for a while waiting for new dial attempts. - we scale down when we complete a dial and realise nobody was waiting for it. Dialler throttling (e.g. FD limit exceeded) is a concern, as we can easily spin up more workers to compensate, and end up adding fuel to the fire. Since we have no deterministic way to detect this for now, we hard-limit concurrency to `DialQueueMaxParallelism` (20).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice and clean.
Future optimisation: cancelling pending dials to worse nodes as we find closer nodes to the target. EDIT: in practice, this is complex, because those theoretically better nodes may never respond, and we would've stopped making progress. The algorithm would have to compensate by backtracking and replaying those dials. Quite a dance. |
Addressed the review comments, but I noticed a flaky test on CI along the way. I do deplore depending on time, but I cannot think of another way to test this. |
4ab14f9
to
74d22f3
Compare
@Stebalien – up for re-review. I ended up changing the waiting mechanism to a slice, like we discussed in comments. |
Currently the DHT is performing dials outside of the Alpha concurrency limit. We are dialling all nodes that peers return in
CloserPeers
without limit. As a result, we end up flooding the swarm with dial jobs, which trips over the file descriptor limits, and brings dialling to a halt under some circumstances. Our current approach is also algorithmically incorrect, and leads to suboptimal query patterns.This patch introduces an adaptive dial queue that spawns a dynamically sized set of goroutines to preemptively stage dials for later handoff to the DHT protocol for RPC. It identifies backpressure on both ends (dial consumers and dial producers), and takes compensating action by adjusting the worker pool.
We start with
DialQueueMinParallelism
number of workers (6), and scale up and down based on demand and supply of dialled peers.The following events trigger scaling:
Dialler throttling (e.g. FD limit exceeded) is a concern, as we can easily spin up more workers to compensate, and end up adding fuel to the fire. Since we have no deterministic way to detect this for now, we hard-limit concurrency to
DialQueueMaxParallelism
(20).Testing this patch in a production mirror reduced dial backlog considerably, and showed the adaptiveness in action: