-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout to Client RPC calls #11500
Conversation
Timeout can be reached when running many tests concurrently.
8df46bc
to
244eedb
Compare
Log error only when failure can no longer be retried.
When an RPC with a blocking query is retried the deadline should include the max query time, not just the remaining time in the initial query. A query may block up to twice as long as requested if any retries occur.
Thank you for working on this problem! Unfortunately I think this lack of read timeouts is going to be a very challenging thing to solve. I looked to see if we have an issue that describes the problem, but I haven't found one. I have not yet looked at this PR in much detail, but from a quick glance I think There are a number of challenges in fixing this problem because we have many paths for client connections. I believe the current one is Yamux, so we'd need support to timeout a multiplexed connection. There are a few older PRs open on the yamux repo that aim to add that support, but I'm not sure what state those are in. Adding something like this will require a lot of testing. I would still like to take the time to write up a better issue but that will take some more time to research. |
@dnephin thanks for taking a look!
Yes, the implementation refactors/reuses the logic previously added in #10299 / #8978 which already takes this into account. For blocking queries, the timeout is derived from the provided consul/agent/structs/structs.go Lines 303 to 317 in d6e2cbb
Yamux already supports global timeout of a multiplexed connection through its keepalive-ping feature (which immediately fails all in-flight RPCs going over the connection). More relevant here, Yamux also already supports timing out an individual stream through Lines 593 to 597 in d6e2cbb
I didn't notice anything else that would be needed for this to work.
If it's helpful, see |
After running this patch in production for a while, I observed two issues:
|
@wjordan thanks for working on this! I'll need to dedicate some time to look into all of the links you've shared. It is quite possible that my understanding is stale and maybe we can improve the timeout support. I'm hoping to look into this in more detail in the next couple of weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR and apologies for the delay in review. This PR touches on critical code paths so we'll try to get input from more members of our team.
In the mean time, I've left a couple of questions I'd like to see clarified
agent/pool/pool.go
Outdated
// TimeoutConn wraps net.Conn with a default timeout. | ||
// readTimeout and writeTimeout apply to the subsequent operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment accurate? Read
and Write
implementations look like ReadTimeout
and WriteTimeout
are applied first, zero'ed out, then DefaultTimeout
is applied on subsequent operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is intentional but could perhaps be worded more clearly. What I meant was that setting a read/write timeout only affects the very next operation, and then the timeout reverts to the default after that.
@@ -177,7 +177,7 @@ func testServerConfig(t *testing.T) (string, *Config) { | |||
|
|||
// TODO (slackpad) - We should be able to run all tests w/o this, but it | |||
// looks like several depend on it. | |||
config.RPCHoldTimeout = 5 * time.Second | |||
config.RPCHoldTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary for tests to pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't recall exactly (it's been a few months) but according to the commit message I left in cb4de4d, the 5-second RPCHoldTimeout
was causing some timeouts when running many tests concurrently.
// The default timeout for stream reads/writes | ||
Timeout time.Duration | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand the difference between ConnPool.Timeout
vs the timeout args that have been added to all the rpc
methods?
I'm wondering if it's possible to achieve the same behaviour without adding an arg (which is defaulted to 0 in most cases) to an already long list of arguments to rpc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConnPool.Timeout
('The default timeout for stream reads/writes') is set from RPCHoldTimeout
, and represents the default timeout for any stream read/write when an explicit timeout for a specific read/write call is not provided. RPCHoldTimeout
represents an existing duration the server is tuned to expect round-trips to be shorter than, so it seemed like a reasonable timeout to use for most reads/writes going through this connection pool.
The ConnPool.rpc
method sets a specific timeout
for the first stream read after the query is sent. This is because the expected wait time for a (potentially blocking) query response can be much longer than the default round-trip timeout configuration.
I'm wondering if it's possible to achieve the same behaviour without adding an arg (which is defaulted to 0 in most cases) to an already long list of arguments to rpc.
Thanks for the suggestion. Yes, Dropping the timeout
argument by moving the RPC-timeout calculation from Client.RPC
to ConnPool.rpc
requires MaxQueryTime
and DefaultQueryTime
to be passed to ConnPool
, but I think it does end up a bit cleaner that way.
agent/pool/pool.go
Outdated
@@ -78,12 +119,14 @@ func (c *Conn) getClient() (*StreamClient, error) { | |||
return nil, err | |||
} | |||
|
|||
timeoutStream := &TimeoutConn{stream, c.pool.Timeout, 0, 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more readable to explicitly define struct fields here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever set WriteTimeout anywhere else in this PR? Wondering if it's valuable to keep read/write timeouts separate (my naive assumption would be that the upper bound on size of reads and writes wouldn't vary enough to benefit from tuning them separately)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WriteTimeout
isn't used anywhere, so I can remove WriteTimeout
and just make TimeoutConn.Write
pass through to c.Conn.Write
.
The read timeout needs to be kept separate from writes though, because the RPC timeout is read-specific, it applies to the first stream Read
following the RPC request.
Other changes based on feedback
@kisunji thanks for the detailed review! Sorry for taking a while for my response. I've pushed a couple of commits that reduce the size of the PR diff based on your feedback and should hopefully address the helpful questions you raised. |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/651287. |
🍒✅ Cherry pick of commit c48120d onto |
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. Co-authored-by: kisunji <[email protected]>
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/651305. |
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. Co-authored-by: kisunji <[email protected]>
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. Co-authored-by: kisunji <[email protected]>
* Add timeout to Client RPC calls (#11500) Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. Co-authored-by: kisunji <[email protected]> * Use the maximum jitter when calculating the timeout The timeout should include the maximum possible jitter since the server will randomly add to it's timeout a jitter. If the server's timeout is less than the client's timeout then the client will return an i/o deadline reached error. Before: ``` time curl 'http://localhost:8500/v1/catalog/service/service?dc=other-dc&stale=&wait=600s&index=15820644' rpc error making call: i/o deadline reached real 10m11.469s user 0m0.018s sys 0m0.023s ``` After: ``` time curl 'http://localhost:8500/v1/catalog/service/service?dc=other-dc&stale=&wait=600s&index=15820644' [...] real 10m35.835s user 0m0.021s sys 0m0.021s ``` Co-authored-by: Will Jordan <[email protected]> Co-authored-by: kisunji <[email protected]> Co-authored-by: James Hartig <[email protected]>
Fixes #8504. Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.
See further discussion of this issue in #6616 (comment):
This PR reuses the timeout values calculated by the retry-timeout logic in #10299 / #8978, adding an extra
Timeout
function to theRPCInfo
interface that returns the timeout duration, to supplement theHasTimedOut
function that returns a boolean.