Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/pkg/balancer uses experimental gRPC APIs #499

Closed
dfawley opened this issue Aug 14, 2023 · 15 comments · Fixed by #580
Closed

go/pkg/balancer uses experimental gRPC APIs #499

dfawley opened this issue Aug 14, 2023 · 15 comments · Fixed by #580
Assignees

Comments

@dfawley
Copy link

dfawley commented Aug 14, 2023

These APIs are labeled as experimental, and they will be changed in the near-ish future. Details are in grpc/grpc-go#6472. Note that any package that uses gRPC's experimental APIs should itself be considered experimental, as a change to them would break compilation for the entire package that uses them. So we do NOT recommend the use of these features in any other libraries or applications that aren't tolerant to breakages like this.

https://github.com/googleapis/google-cloud-go may have another solution to the same type of problem, which is that it creates multiple channels and load balances between them. See https://github.com/googleapis/google-api-go-client/blob/caea95689f82049822552cd649765335123831e0/transport/grpc/pool.go#L30 for details.

@mrahs
Copy link
Collaborator

mrahs commented Sep 9, 2023

Thanks Doug. This has been working just fine so far and I'm worried about regressions if I change the implementation. I'm considering copying the relevant code over under an "internal" directory here to remove the direct dependency on the experimental API.

Would that be sufficient to remove the direct dependency on the experimental API? Do you see any licensing issues?
CL: mrahs@4cafa22

@mrahs mrahs self-assigned this Sep 9, 2023
mrahs added a commit to mrahs/remote-apis-sdks that referenced this issue Sep 9, 2023
Since this has been working just fine so far, I opted to copying the relevant
code over to avoid regressions from a change of implementation.
Fixes bazelbuild#499.
@dfawley
Copy link
Author

dfawley commented Sep 13, 2023

@mrahs,

I'm considering copying the relevant code over under an "internal" directory here to remove the direct dependency on the experimental API.

Maybe I'm misunderstanding, but I don't know how this could actually work unless you fork all of grpc. You're depending on grpc's behavior as much as the APIs themselves, so just copying the APIs won't let you keep using the functionality in grpc that is behind those APIs in the client implementation.

Worst case, I can help to just update the deprecated things to the newer versions. You'd still be using a currently-experimental API, which isn't great but is no worse than the current state at least.

Are you actually running into GFE max streams limits in practice? If so you might want to adopt something like I linked in the original comment, which seems to be working for the google cloud client APIs. That avoids using any experimental APIs of gRPC's. You may even be able to use the package directly, since it's not in internal. @codyoss might be able to help with this.

Feel free to put something on my calendar any time you see an opening to discuss if desired.

@mrahs
Copy link
Collaborator

mrahs commented Sep 15, 2023

Since the grpc-go version is pinned here, I was considering the simple case of not wanting this module to depend on the upstream one. My reading of the issue is that this module will break (either at compile or runtime) when grpc-go is updated to a version where the experimental APIs have changed (in behaviour or interface). Otherwise, it should continue to work.

I'll look into addressing this properly as you suggested.

@dfawley
Copy link
Author

dfawley commented Sep 15, 2023

Since the grpc-go version is pinned here

If you are a binary and not a library, then this will work for you in theory until you need to update grpc-go, or until you need to update a dependency that updates grpc-go. But in that case, no change is needed here at all.

However, since we both import to google3, this will be making my life more challenging, as we'll need to get you updated before gRPC-Go can delete the old APIs.

@mrahs
Copy link
Collaborator

mrahs commented Sep 15, 2023

Right.. this forces SDK users to use the same version until it's updated. I also forgot about the google3 copy.. The urgency of this make more sense now.

mrahs added a commit to mrahs/remote-apis-sdks that referenced this issue Oct 2, 2023
Since this has been working just fine so far, I opted to copying the relevant
code over to avoid regressions from a change of implementation.
Fixes bazelbuild#499.
@mrahs
Copy link
Collaborator

mrahs commented Oct 19, 2023

@codyoss I'm wondering how bad connections are handled in the pool implementation in cloud libraries: https://github.com/googleapis/google-api-go-client/blob/5e4c19e2fded180e57163208fcf264d26166a99e/transport/grpc/pool.go#L30
Is the client expected to close the entire pool if one connection fails?

IIUC, the experimental balancer package defines an API for managing the lifecycle of a connection: https://github.com/grpc/grpc-go/blob/e88e8498c6df39703d1992bc4b677e7ef477145d/balancer/balancer.go#L384 which allows regenerating a connection without affecting others.

@dfawley
Copy link
Author

dfawley commented Oct 20, 2023

Is the client expected to close the entire pool if one connection fails?

If a ClientConn has a lost connection, it would re-establish the connection on its own. But I'm not sure if the connection pool would skip it for dispatching RPCs - without something like that, some RPCs might error or be queued, even if there are other connections that could be used successfully instead.

the experimental balancer package defines an API for

The main goal is to avoid using that outside of gRPC until the API is stable, though.

@mrahs
Copy link
Collaborator

mrahs commented Oct 20, 2023

Thanks Doug. I think that answers my question about cloud libraries, but I'm still not sure how best to exclude connections in bad state from the pool. That balancer API helps keep track of good SubCons which the picker can choose from.

I'm considering using a similar approach to that in cloud libraries, which is by overriding Invoke and NewStream to delegate to a dispatcher before forwarding the call to the underlying ClientConn. However, I'm not yet sure how to exclude connections in a bad state.

@dfawley
Copy link
Author

dfawley commented Oct 23, 2023

I'm still not sure how best to exclude connections in bad state from the pool

We do have an API to show the connectivity state of the channel as a whole. But I wonder if you can just ignore that potential problem and just re-use what the client libraries have directly? You'll always have to deal with retrying failed RPCs anyway, right? Like if you have only one channel and an RPC on it errors.

@mrahs
Copy link
Collaborator

mrahs commented Oct 23, 2023

The method I found GetState is also experimental: https://github.com/grpc/grpc-go/blob/e88e8498c6df39703d1992bc4b677e7ef477145d/clientconn.go#L738
Is there a stable alternative?

Without reacting to connection status changes there is potential to increase retry rates. I don't know how bad might that be in the wild, but I'm considering my options to mitigate that. My best bet so far is to spread retries across the pool to minimize the chance of a request getting the same bad connection. Round-robin seems like the simplest policy that achieves that, but I don't know who is relying on the current affinity-based-and-least-busy policy here and I'd rather not find out by pulling the plug on them.

@dfawley
Copy link
Author

dfawley commented Oct 25, 2023

The method I found GetState is also experimental: grpc/grpc-go@e88e849/clientconn.go#L738

Ah, right. I don't think we particularly like that API since it's lossy (i.e. you can miss a temporary state), which isn't important for most use cases but also isn't really ideal. I suspect that one would be difficult to ever remove or change, unlike the balancer APIs, and there are no current plans to do so.

Without reacting to connection status changes there is potential to increase retry rates. I don't know how bad might that be in the wild

I honestly don't think it would be any worse than using a single channel with the current LB policy you are using.

RPC failures will occur for RPCs currently in flight if a backend goes down. At that point, new RPCs to that channel will not fail. The channel will go into CONNECTING, so they would be queued while the new connection is established. If no backend can be reached, then the RPCs queued on it will fail eventually, but most likely all the other channels will be down as well.

I don't know who is relying on the current affinity-based-and-least-busy policy here and I'd rather not find out by pulling the plug on them.

Interesting. Who uses the gRPC clients created with this LB policy? I had assumed it was your code. I thought you just copied this out of https://github.com/GoogleCloudPlatform/grpc-gcp-go for the connection pooling aspect, but for your own use, and because of the max concurrent streams issue you were running into on GFE.

@mrahs
Copy link
Collaborator

mrahs commented Apr 19, 2024

It turns out the code was actually copied from the gcp repo. I can test changes with bazelbuild/reclient, but I'm not sure about potential other clients. I'll simplify the policy to get rid of experimental API usage. I'll also try and make it pluggable such that if someone else really wants a more sophisticated policy they can plug their own (unless that turns into a deep rabbit hole).

@dfawley
Copy link
Author

dfawley commented Apr 19, 2024

All custom LB policies use experimental code, as the balancer package itself is experimental. :)

I still recommend starting with what I mentioned in the initial comment, and let me know if that solution isn't working for any reason, and we could go from there. I'm happy to meet and discuss details in person if it helps.

@mrahs
Copy link
Collaborator

mrahs commented Apr 24, 2024

Posted #554 as an initial step. Will post another one to remove the dependent code once the new balancer is fully rolled out.

I'm not 100% sure about using the grpc.CientConnInterface based on its godocs. I also had to resort to type assertion to close the underlying grpc.ClientConn. Any tips?

@dfawley
Copy link
Author

dfawley commented Apr 24, 2024

I don't think I would use that interface instead of a *grpc.ClientConn with type assertions in order to close.

Instead, I'd define my own interface like:

type grpcClientConn interface {
	grpc.ClientConnInterface
	Close() error
}

Now you know (and can verify in code) that *grpc.ClientConn implements it, as does the RRBalancer:

var _ grpcClientConn = (*grpc.ClientConn)(nil)
var _ grpcClientConn = (*balancer.RoundRobinConnPool)(nil)

Also, prefer returning concrete types over interfaces whenever possible: i.e. export RoundRobinConnPool and return it instead of grpcClientConn.

mrahs added a commit to mrahs/remote-apis-sdks that referenced this issue Jun 3, 2024
The custom balancer depends on experimental APIs from grpc-go.
It has since been replaced with a simple round robin balancer
that has been working well in production.

Fixes bazelbuild#499
mrahs added a commit to mrahs/remote-apis-sdks that referenced this issue Jun 3, 2024
The custom balancer depends on experimental APIs from grpc-go.
It has since been replaced with a simple round robin balancer
that has been working well in production.

Fixes bazelbuild#499
mrahs added a commit to mrahs/remote-apis-sdks that referenced this issue Jun 3, 2024
The custom balancer depends on experimental APIs from grpc-go.
It has since been replaced with a simple round robin balancer
that has been working well in production.

Fixes bazelbuild#499
@mrahs mrahs closed this as completed in #580 Jun 3, 2024
mrahs added a commit that referenced this issue Jun 3, 2024
The custom balancer depends on experimental APIs from grpc-go.
It has since been replaced with a simple round robin balancer
that has been working well in production.

Fixes #499
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants