Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: Implement gRFC A6: configurable client-side retry support #2111

Merged
merged 20 commits into from
Jun 27, 2018

Conversation

dfawley
Copy link
Member

@dfawley dfawley commented May 31, 2018

This change is Reviewable

@dfawley dfawley added the Type: Feature New features or improvements in behavior label May 31, 2018
@dfawley dfawley added this to the 1.13 Release milestone May 31, 2018
@dfawley dfawley requested a review from menghanl May 31, 2018 17:17
@menghanl
Copy link
Contributor

menghanl commented Jun 8, 2018

Still looking at main retry logic in stream.go.


Reviewed 6 of 10 files at r1, 1 of 2 files at r2.
Review status: 7 of 10 files reviewed at latest revision, all discussions resolved.


clientconn.go, line 445 at r2 (raw file):

}

// WithEnableRetry returns a DialOption that enables retries if the service

How about moving this to an environment variable instead?
The advantage would be, no code change will be needed when we enable retry in the future.


clientconn.go, line 1054 at r2 (raw file):

	if sc.RetryThrottling != nil {
		cc.retryMu.Lock()
		cc.retryTokens = sc.RetryThrottling.MaxTokens

There are things similar to this done in handleServiceConfig.

Move those to this function, too? Or add TODO?


clientconn.go, line 1633 at r2 (raw file):

}

// throttleRetry subtracts a retry token from the pool and returns whether a

Make a util struct for retryTokens and the following functions?


service_config.go, line 100 at r2 (raw file):

	// If token_count is less than or equal to maxTokens / 2, then RPCs will not
	// be retried and hedged RPCs will not be sent.
	RetryThrottling *RetryThrottlingPolicy

unexport this field. The only reason to export fields in this struct is to use WithServiceConfig, but we don't want users to use that.


service_config.go, line 106 at r2 (raw file):

// service config here:
// https://github.com/grpc/proposal/blob/master/A6-client-retries.md#integration-with-service-config
type RetryPolicy struct {

This type doesn't need to be exported, either. Unless we want to provide another way to config retry other than service config.


stream.go, line 433 at r2 (raw file):

	// TODO(retry): Move down if the spec changes to not check code before
	// considering this a failure.
	pushback := 0

Not sure if this is what the TODO is talking about.

This variable is not used until line 463. Move this down after the ifs?


stream.go, line 439 at r2 (raw file):

		var e error
		if pushback, e = strconv.Atoi(sps[0]); e != nil || pushback < 0 {
			grpclog.Infof("Server retry pushback specified to abort (%q).", sps[0])

This is inaccurate if e != nil.


stream.go, line 468 at r2 (raw file):

		max := float64(rp.MaxBackoff)
		cur := float64(rp.InitialBackoff)
		for i := 0; i < cs.numRetriesSincePushback; i++ {

To avoid the for loop, how about comparing
cs.numRetriesSincePushback with
\log_{multipler}\frac{max}{init} math link

The latter value can be calculated every time service config gets updated.


Comments from Reviewable

@dfawley dfawley force-pushed the attempt_retry_2 branch 2 times, most recently from 515f892 to 9bd5580 Compare June 12, 2018 19:35
@dfawley dfawley closed this Jun 12, 2018
@dfawley dfawley reopened this Jun 12, 2018
@dfawley
Copy link
Member Author

dfawley commented Jun 12, 2018

Review status: 7 of 10 files reviewed, 8 unresolved discussions (waiting on @menghanl)


clientconn.go, line 445 at r2 (raw file):

Previously, menghanl (Menghan Li) wrote…

How about moving this to an environment variable instead?
The advantage would be, no code change will be needed when we enable retry in the future.

Done.


clientconn.go, line 1054 at r2 (raw file):

Previously, menghanl (Menghan Li) wrote…

There are things similar to this done in handleServiceConfig.

Move those to this function, too? Or add TODO?

Deleted setSC after unexporting retry configuration fields from service config structs.


clientconn.go, line 1633 at r2 (raw file):

Previously, menghanl (Menghan Li) wrote…

Make a util struct for retryTokens and the following functions?

Done.


service_config.go, line 100 at r2 (raw file):

Previously, menghanl (Menghan Li) wrote…

unexport this field. The only reason to export fields in this struct is to use WithServiceConfig, but we don't want users to use that.

Done.


service_config.go, line 106 at r2 (raw file):

Previously, menghanl (Menghan Li) wrote…

This type doesn't need to be exported, either. Unless we want to provide another way to config retry other than service config.

Done.


stream.go, line 433 at r2 (raw file):

Previously, menghanl (Menghan Li) wrote…

Not sure if this is what the TODO is talking about.

This variable is not used until line 463. Move this down after the ifs?

I meant to write "not check pushback" instead. It is important to check the server pushback value before calling throttleRetry() (which has side-effects), but there was talk of changing the spec to call throttleRetry() first instead. Comment updated.


stream.go, line 439 at r2 (raw file):

Previously, menghanl (Menghan Li) wrote…

This is inaccurate if e != nil.

If e != nil, that means a non-numeric response was provided, which the spec says is a valid way for the server to indicate retry should not be performed.


stream.go, line 468 at r2 (raw file):

Previously, menghanl (Menghan Li) wrote…

To avoid the for loop, how about comparing
cs.numRetriesSincePushback with
\log_{multipler}\frac{max}{init} math link

The latter value can be calculated every time service config gets updated.

I removed the loop and replaced it by math.Pow(), but left out the optimization that avoids the need to call Pow() when we've hit the max backoff, since it should be quite rare and a minor win.


Comments from Reviewable

@dfawley dfawley force-pushed the attempt_retry_2 branch 2 times, most recently from 974b67e to e06ec2e Compare June 12, 2018 20:34
@menghanl menghanl modified the milestones: 1.13 Release, 1.14 Release Jun 20, 2018
@dfawley
Copy link
Member Author

dfawley commented Jun 20, 2018

This should be ready to review again, thanks!

@dfawley dfawley force-pushed the attempt_retry_2 branch 2 times, most recently from 991340f to 3ce0654 Compare June 22, 2018 15:49
@menghanl
Copy link
Contributor

Reviewed 5 of 7 files at r3, 1 of 3 files at r4, 1 of 2 files at r5, 7 of 9 files at r7, 1 of 1 files at r8, 1 of 1 files at r9.
Review status: 14 of 16 files reviewed, 5 unresolved discussions (waiting on @dfawley and @menghanl)


stream.go, line 329 at r9 (raw file):

}

type fatalErr struct {

Delete this struct.


stream.go, line 339 at r9 (raw file):

	if err != nil {
		if a.done != nil {
			a.done(balancer.DoneInfo{Err: err})

newStream() could be retried, so it shouldn't call done

done is already called by finish.


stream.go, line 427 at r9 (raw file):

	}

	// Wait for the current attempt to complete so trailers have arrived.

This comment is inappropriate. The wait happens in the TrailersOnly call.


stream.go, line 444 at r9 (raw file):

	hasPushback := false
	if cs.attempt.s != nil {
		if to, toErr := cs.attempt.s.TrailersOnly(); toErr != nil {

TrailersOnly also returns true when there's a connection error. But the stream is not necessary trailers only.


stream.go, line 581 at r9 (raw file):

	// RecvMsg() returned a non-nil error before calling this function is valid.
	// We would have retried earlier if necessary.
	return cs.attempt.s.Trailer()

Could s be nil?


Comments from Reviewable

@dfawley
Copy link
Member Author

dfawley commented Jun 27, 2018

Review status: 14 of 16 files reviewed, 5 unresolved discussions (waiting on @menghanl and @dfawley)


stream.go, line 329 at r9 (raw file):

Previously, menghanl (Menghan Li) wrote…

Delete this struct.

Done.


stream.go, line 339 at r9 (raw file):

Previously, menghanl (Menghan Li) wrote…

newStream() could be retried, so it shouldn't call done

done is already called by finish.

This actually was necessary, because I wasn't calling finish on the attempt in retry, which is another bug. Done.


stream.go, line 427 at r9 (raw file):

Previously, menghanl (Menghan Li) wrote…

This comment is inappropriate. The wait happens in the TrailersOnly call.

Done.


stream.go, line 444 at r9 (raw file):

Previously, menghanl (Menghan Li) wrote…

TrailersOnly also returns true when there's a connection error. But the stream is not necessary trailers only.

I know, but names are hard. NoHeaders?


stream.go, line 581 at r9 (raw file):

Previously, menghanl (Menghan Li) wrote…

Could s be nil?

Yes, if the latest newStream attempt failed. Done.


Comments from Reviewable

@menghanl
Copy link
Contributor

Reviewed 6 of 8 files at r10.
Review status: 13 of 15 files reviewed, 2 unresolved discussions (waiting on @dfawley and @menghanl)


stream.go, line 266 at r10 (raw file):

	// finish() will affect stats and tracing.
	// TODO(dfawley): move to newAttempt when per-attempt stats are implemented.
	cs.attempt.statsHandler = sh

Move statsHandler and traceInfo as parameters for newAttemptLocked()


stream.go, line 560 at r10 (raw file):

	// We would have retried earlier if necessary.
	if cs.attempt.s == nil {
		return nil

Should this return the previous received trailer?


Comments from Reviewable

@dfawley
Copy link
Member Author

dfawley commented Jun 27, 2018

Review status: 12 of 15 files reviewed, 2 unresolved discussions (waiting on @menghanl)


stream.go, line 266 at r10 (raw file):

Previously, menghanl (Menghan Li) wrote…

Move statsHandler and traceInfo as parameters for newAttemptLocked()

Done.


stream.go, line 560 at r10 (raw file):

Previously, menghanl (Menghan Li) wrote…

Should this return the previous received trailer?

No...but I added a commitAttempt here since it will prevent races when users don't use the API correctly. And updated the comment.


Comments from Reviewable

@menghanl
Copy link
Contributor

Reviewed 1 of 7 files at r3, 1 of 8 files at r10, 1 of 1 files at r11.
Review status: :shipit: complete! all files reviewed, all discussions resolved


Comments from Reviewable

@dfawley
Copy link
Member Author

dfawley commented Jun 27, 2018

😌

@annismckenzie
Copy link

@dfawley I'm having a hard time understanding how this is implemented and supposed to be used. Setting the env variable will turn retry support on? Or does that only work when using a Service Config which is also not documented anywhere? I love gRPC but the docs stop right after the basic tutorial. 🙈 (as a sidenote: we've been using gRPC in production for over two years now so I'm not a first-time user and have read the source to some degree; the stab at the docs also isn't meant to be condescending) Any pointers would be appreciated.

@dfawley
Copy link
Member Author

dfawley commented Aug 21, 2018

@annismckenzie

Setting the env variable will turn retry support on? Or does that only work when using a Service Config which is also not documented anywhere?

Yes, you will need to do both (set environment and enable in service config). Service configs are documented generally here, and in Go are set via the resolver (by calling ClientConn.NewServiceConfig). Our default DNS resolver will provide a service config if one exists in your TXT records (more info in the SC-in-DNS gRFC). Note that for backward compatibility reasons, the DNS resolver is not the default resolver in grpc-go, so you need to use grpc.Dial("dns:///<target>") to use it if you decide to go that route vs. making a custom resolver. The retry settings of service config are documented the the retry gRFC. Let us know if anything isn't clear or you need more specific help.

the stab at the docs also isn't meant to be condescending

It's totally fair. We know our documentation is not ideal. Any specific suggestions you have about places where we can improve them and how to do so (e.g. what's better: godoc or our repo's Documentation directory or grpc.io) would be appreciated.

@lock lock bot locked as resolved and limited conversation to collaborators Feb 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants