- Author(s): Eric Anderson
- Approver: a11r
- Status: Implemented
- Implemented in: C, Java, Go
- Last updated: 2017-03-31
- Discussion at: https://groups.google.com/d/topic/grpc-io/Yc5_vSUJgwQ/discussion
Mobile has very poor network connectivity and TCP connection failures are common (commonly due to NATs). However, in the absence of writes the OS will not detect such failures and in the presence of writes detection can take many minutes.
L4 proxies may be configured to disconnect "idle" connections. Google Cloud load balancers disconnect apparently-idle connections after 10 minutes. AWS ELB disconnects after 60 seconds.
gRPC supports long-lived streams. A simple use-case is a long-lived stream where the server notifies clients of events, as an alternative to long polling. Thus a connection being idle on the network doesn't imply that no RPCs are outstanding.
We want gRPC to be reliable in these situations, which necessitates some form of keepalive mechanism. We want to support avoiding connection breaks and detecting them when the occur. But it is important to minimize accidental DDoS risk.
The problem exists for both clients and servers. Since the failures involve the connection between client and server being severed, it needs to be handled on each side. Because DDoS risk is limited for a server-side keepalive, server-side can have much higher latency when detecting dead connections, and server-side is more centrally configured, it is considered sufficiently different enough to have a separate design and is not discussed here further.
Because there may be many clients maintaining connections open to servers it is very easy to DDoS yourself with keepalives, or simply waste a lot of network and CPU. Although a particular keepalive is small and trivial to respond to, if you have 1 million phones sending a PING every 10 seconds, that is 100 thousand QPS for no work. Thus this feature should be used conservatively and in concert/deference to other more scalable methods (e.g., configuring the TCP connection to be closed on idle).
There is a related but separate concern called "health checking." It tends to exist at a higher level and is usually whether a service is healthy (vs a specific hop-by-hop connection). As such, keepalive and health checking have different failure models and are independent (keepalive failing implies nothing about health checking, and vise versa). However, in some common failure cases the two are coorelated.
To be scalable, health checking should generally be centralized and interact with the name resolution or load balancing systems. However, such systems take time to develop and deploy and are generally unavailable on "day 1." Thus, this keepalive design includes support for being used as a "poor man's health checking." It should generally be viewed as a short-term solution, however.
TCP keepalive is a well-known method of maintaining connections and detecting broken connections. It is disabled by default, but when enabled after an implementation-specific duration of inactivity (on the order of 1-2 hours, but sysadmin configurable) will begin sending redundant packets waiting for their ACK. If ACK, then connection seems good. If no ACK after repeated attempts, the connection is deemed broken. Configuration of its variables is provided by the OS on a per-socket level, but is commonly not exposed to higher-level APIs.
TCP keepalive has three parameters to tune:
- time (time since last receipt before sending a keepalive),
- interval (interval between keepalives when not receiving reply), and
- retry (number of times to retry sending keepalives).
gRPC uses HTTP/2 which provides a mandatory PING frame that causes the receiver to immediately reply with a PING ACK. There are no semantics to PING other than the need to reply promptly. It can be used to estimate round-trip time, bandwidth-delay product, or test the connection. Chrome uses them for similar reasons as described here, but we are not aware of their precise usage. In passing I've heard they "sent a PING every 45 seconds." Some documentation suggests it may be more nuanced than that. Although PING support is mandatory in HTTP/2, it may be subject to abuse restrictions.
While gRPC implementations have tight control of the HTTP/2 stack, this doesn't mean it isn't necessary to interoperate with other implementations. When considering the design, recognize the keepalive may be happening between a gRPC client and a generic HTTP/2 proxy.
To be clear, the design does not require service owners support keepalive. client authors must coordinate with service owners for whether a particular client-side setting is acceptable. Service owners decide what they are willing to support, including whether they are willing to receive keepalives at all.
Implement an application-level keepalive conceptually based on TCP keepalive, using HTTP/2's PING. Interval and retry don't quite apply to PING because the transport is reliable, so they will be replaced with timeout (equivalent to interval * retry), the time between sending a PING and not receiving any bytes to declare the connection dead.
Doing some form of keepalive is relatively straightforward. But avoiding DDoS is not as easy. Thus, avoiding DDoS is the most important part of the design. To mitigate DDoS the design:
- Disables keepalive for HTTP/2 connections with no outstanding streams, and
- Suggests for clients to avoid configuring their keepalive much below one minute (see Server Enforcement section for additional details)
Most RPCs are unary with quick replies, so keepalive is less likely to be triggered. It would primarily be triggered when there is a long-lived RPC.
Since keepalive is not occurring on HTTP/2 connections without any streams, there will be a higher chance of failure for new RPCs following a long period of inactivity. To reduce the tail latency for these RPCs, it is important to not reset the keepalive time when a connection becomes active; if a new stream is created and there has been greater than 'keepalive time' since the last read byte, then a keepalive PING should be sent (ideally before the HEADERS frame). Doing so detects the broken connection with a latency of keepalive timeout instead of keepalive time + timeout.
keepalive time is ideally measured from the time of the last byte read. However, simplistic implementations may choose to measure from the time of the last keepalive PING ack (a.k.a., polling). Such implementations should take extra precautions to avoid issues due to latency added by outbound buffers, such as limiting the outbound buffer size and using a larger keepalive timeout. Implementations must not measure from the last keepalive PING sent, to avoid triggering abuse detection on servers that are configured with settings that exactly match behavior of the client.
As an optional optimization, when keepalive timeout is exceeded, don't kill the connection. Instead, start a new connection. If the new connection becomes ready and the old connection still hasn't received any bytes, then kill the old connection. If the old connection wins the race, then kill the new connection mid-startup.
The keepalive time and keepalive timeout are expected to be application-configurable options, with at least second precision. Applications should ensure keepalive timeout is at least multiple times the round-trip time to allow for lost packets and TCP retransmits. It may also need to be higher to account for long garbage collector pauses. Since networks are not static, implementations are permitted to adjust the timeout based on network latency.
When a client receives a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings", it should log the occurrence at a log level that is enabled by default and double the configure KEEPALIVE_TIME used for new connections on that channel.
Services failures are more frequent than the TCP failures, so detection delay needs to be decreased in order to reduce number of RPCs affected by undiscovered failures.
Thus we make these from the basic keepalive:
Disables keepalive for HTTP/2 connections with no outstanding streams, andSuggestsRestricts clients to avoid configuring their keepalive belowone minuteten seconds
That is, "keepalive" may continue even without any outstanding streams and the minimum is reduced to 10 seconds but also enforced on client-side.
Clients will have three channel settings for configuration:
KEEPALIVE_TIME
, defaulting toinfinite
. If smaller than ten seconds, ten seconds will be used instead.KEEPALIVE_TIMEOUT
, defaulting to20 seconds
KEEPALIVE_WITHOUT_CALLS
, defaulting tofalse
Servers need to respond to misbehaving clients by sending GOAWAY with error code ENHANCE_YOUR_CALM and additional debug data of ASCII "too_many_pings" followed by immediately closing the connection. Immediately closing the connection fails any in-progress RPCs which increases the chance of the client author detecting the misconfiguration.
Servers will have two settings for enforcement:
PERMIT_KEEPALIVE_TIME
, defaulting to5 minutes
PERMIT_KEEPALIVE_WITHOUT_CALLS
, defaulting tofalse
PINGs have other uses than keepalive, like measuring latency and network bandwidth, thus servers should permit receiving a PING after sending HEADERS and DATA frames and generally support some "fuzziness" to their usage.
Suggested algorithm:
- Initialize
MAX_PING_STRIKES = 2; ping_strikes = 0
- Initialize
last_valid_ping_time = epoch
- The clock used need not be precise; one second is enough precision, although millisecond precision would be encouraged as future-proofing. But it should be accurate for measuring durations; a monotonic clock should be used
- When a PING frame is received:
- If
active_streams == 0 && PERMIT_KEEPALIVE_WITHOUT_CALLS == false
, verifylast_valid_ping_time + 2 hours <= now()
- Otherwise verify
last_valid_ping_time + PERMIT_KEEPALIVE_TIME <= now()
- If either verification failed, ping_strikes++
- if
ping_strikes > MAX_PING_STRIKES
, client is misbehaving
- if
- otherwise
last_valid_ping_time = now()
- If
- When a HEADERS or DATA frame is sent, set
last_valid_ping_time = epoch; ping_strikes = 0
The "2 hours" restricts the number of PINGS to an implementation equivalent to TCP Keep-Alive, whose interval is specified to default to no less than two hours.
To allow changing clients to be more aggressive in the future, server responses should include the Server header (analogous to User-Agent) that includes its version. (TODO: replace with SETTINGS-based version)
TCP keepalive is hard to configure in Java and Go. Enabling is easy, but one hour is far too infrequent to be useful; an application-level keepalive seems beneficial for configuration.
TCP keepalive is active even if there are no open streams. This wastes a substantial amount of battery on mobile; an application-level keepalive seems beneficial for optimization.
Application-level keepalive implies HTTP/2 PING.
Supporting health checking unfortunately complicates the design noticeably. Without health checking it would be feasible to ignore server-side enforcement. But it is needed by enough users (even temporarily) to appear to be worth-while.
There are no known generally available methods usable to gRPC for detecting abusive usage of TCP keepalive.
- C. TODO get current progress; close, if not done. Being done by @y-zeng
- Java. Basic Keepalive has been available in OkHttp transport since 1.0 by @zsurocking in grpc/grpc-java#1992. Full spec completed in OkHttp and Netty since grpc-java v1.3.0 or grpc/grpc-java@393ebf7c
- Go. TODO get current progress; close, if not done. Being done by @MakMukhi