-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Use QUIC for RPC transport #23848
Comments
Specific concerns, wrt UDP mangling on the internet, I'm pretty sure that fly.io isn't the only company that was/is operating a single Nomad control plane for a global cluster over the public internet. (proof: https://fly.io/blog/carving-the-scheduler-out-of-our-orchestrator/) |
https://dl.acm.org/doi/10.1145/3589334.3645323 Another nail in QUIC's coffin |
Paper here as well: https://arxiv.org/abs/2310.09423 A great study and concerning for sure, but I'm unconvinced this is a showstopper for Nomad's use case. I have never seen a bandwidth constrained Nomad server. I rarely observe CPU constrained Nomad servers, although it does happen. Disk IO (both latency and throughput) are almost always the limiting factor in Nomad server (and therefore scheduling) performance. This is not to dismiss the concerns raised by this research! The CPU usage in particular is concerning and would warrant careful testing. There's some good discussion on HN with folks from Tailscale who have extensive experience optimizing QUIC. Given the widespread attention on QUIC from every major web player (all the browsers, Cloudflare, etc), I think performance regressions will either be short-lived, or cause widespread pullback from QUIC adoption. This is a long way of saying: we will intentionally adopt QUIC extremely slowly if ever to avoid regressions like those in the linked paper. Unfinished PoCI started a PoC here, but yamux is everywhere throughout Nomad's RPC due to our fairly sophisticated stream and node RPC infrastructure: https://github.com/schmichael/nomad/tree/f-quic I do think the idea of being able to ship QUIC on the same port as the existing TCP RPC would work, so that remains an appealing characteristic of QUIC as a Yamux alternative. There's also quite a few legacy code paths QUIC wouldn't have to implement at all. However, I'm unsure how to implement Nomad's mTLS name validation logic with QUIC since QUIC does not seem to expose the peer's certificates on a per-connection basis. This could mean having to implement something like Node Identities for RPC auth before we could adopt QUIC. |
Background
Nomad like Consul, uses yamux for its RPC layer's underlying network transport. Yamux is based on SPDY. SPDY has been obsolete since 2015, although its ideas form the basis of HTTP/2's transport layer.
Yamux has proven powerful and reliable, needing and receiving very little maintenance over its 10 year lifespan. However this means that there's very little expertise in the codebase when issues do arise, and the code does not adhere to modern Go idioms.
Proposal
Replace Nomad's use of Yamux with QUIC. QUIC is the basis for HTTP/3, but unlike SDPY+HTTP/2, QUIC is being intentionally standardized independently (RFC 9000), and is being proposed for more widespread use such as DNS-over-QUIC (RFC 9250).
UDP
QUIC is based on UDP instead of TCP which poses both an opportunity and risk for Nomad:
This does allow Nomad to add QUIC support at any time and implement an IPv6 Happy Eyeballs style algorithm for determining whether to use the TCP/Yamux or UDP/QUIC transport.
TLS
QUIC mandates TLS. This would require Nomad to mandate TLS and pose a significant upgrade hurdle. Implementing something like Consul's auto config would be necessary to ease the transition, although there's likely no way to upgrade to TLS without forcing some user intervention.
Go Implementations
QUIC is not officially supported by the Go standard library as of Go 1.23. The
crypto/tls
package exposes some QUIC internals but is not intended for direct use. golang/go#58547 tracks QUICs inclusion in Go's stdlib.golang.org/x/net
contains the WIP implementation that is intended to be the basis of Go's future HTTP/3 support.Multiple third party QUIC implementations exist as well, although
quic-go
seems like the dominant implementation:The proposed choice for Nomad would be to use a stdlib implementation to ensure the widest compatibility and most support.
Alternative: libp2p
libp2p forked yamux and has done quite a bit more maintenance. Switching to or merging their fork is a far less significant change than switching protocols.
Alternative: HTTP/3
Instead of switching yamux->quic, Nomad could switch from rpc->http/3. This could entail dropping the entire RPC subsystem (which itself is quite antiquated and lacks basic features such as context cancellation). All RPCs without a corresponding HTTP API would need to have an HTTP API implemented. Raft currently uses its own TCP connection and would need special consideration when moving to HTTP.
This would be a huge undertaking, and there's no reason to do it at the same time as moving from yamux->quic. Upgrading our RPC implementation can be done independently of choosing an underlying transport.
Roadmap
There is no roadmap for implementing QUIC in Nomad.
Please leave feedback in the form of:
The text was updated successfully, but these errors were encountered: