-
Notifications
You must be signed in to change notification settings - Fork 94
Added ExportOne unary RPC call to agent trace service #202
Conversation
54e2e24
to
1cc8610
Compare
1cc8610
to
56376c3
Compare
@@ -45,6 +45,9 @@ service TraceService { | |||
// For performance reasons, it is recommended to keep this RPC | |||
// alive for the entire life of the application. | |||
rpc Export(stream ExportTraceServiceRequest) returns (stream ExportTraceServiceResponse) {} | |||
|
|||
// ExportOne is used to export a single span batch as a unary rpc | |||
rpc ExportOne(ExportTraceServiceRequest) returns (ExportTraceServiceResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ExportUnary
?
Instead of doing this. What if we just do a single operations on top of the bidirectional? We plan to incorporate this feedback in the OpenTelemetry but would be good if we don't have to implement a new protocol here. |
Unary also gives us strong ack guarantees that (current) implementation of streaming does not have. I guess we could technically do that but it would involve a lot of bookkeeping on the client (and I suppose a bit on server) as it would need to keep track of passed/failed requests. I think we'll end up re-implementing gRPC's unary app level. Not sure if it would be worth managing all that at this level when gRPC can already generate all that code for us very reliably. |
@bogdandrutu any comments on this? We have been running this on production for a while now and it has been successful with no visible downsides. |
Based on my understanding unary RPC do not help with load-balancing. If the connection (channel) is not reset every RPC, then load-balancing will not happen, so the whole point of this PR is unclear to me (at least the PR comment). I do understand that unary may help with ACK when executed sync (async has the same problem), but that problem can also be solved using the bi-directional RPC by enforcing to wait for the response instead of streaming the next message (which I am not necessary suggest to do, but want to clarify why we do this way). I do understand that you ran this in production, but it is unclear to me if the bidirectional RPC has a real problem or it was used in a wrong way. I would really appreciate if we spend some time to understand why we change this and not jump into "this works for me" let's do it. It would be very valuable to understand why the streaming did not work for you. Also it would be helpful if you already have an implementation of the client/server to show me what you did there. |
@bogdandrutu that is right, though client side load balancing is supported by many clients and will cycle through different channels for requests. Unless of course the reporter works against a specific channel, and then a intermediate LB would be needed. But.. I would think the same could be done for a bidi stream. You could still cycle which channel's open stream is used to a particular batch. |
Sorry for leaving out important detail from the original comment. This is intended to solve load balancing only with the help of an L7 balancer sitting in front of servers. It wouldn't work without the LB as we are not complimenting it with any special code for client-side balancing. We used an L7 load balancer (Ambassador/Envoy) with gRPC support in front of the the server side OC instances and it was able to balance unary traffic out of the box without any special changes needed on the client or server, while for streaming, it had to keep the channels open once established and let the client/server dictate how or when they wanted to try and re-balance. With streaming, we used to take advantage of the max connection age feature to close streams after a certain time so load would auto-balance periodically. We also experimented with the client and the server having special logic around load balancing and prototypes did work but we felt that having an option to export spans as unary combined with an L7 proxy that supported http2 and gRPC out of the box was much much simpler overall. There were no special code changes in OC other than adding handlers for unary RPC which continued to share most of the code with existing streaming handler.
These two patches are the only other changes we did: These docs show a somewhat similar setup where unary RPC load balances out of the box. https://cloud.google.com/solutions/using-istio-for-internal-load-balancing-of-grpc-services |
One info regarding bi-directional stream with max age: it has the downside that the termination of the connections is perceived by the client as an error that requires a retry and makes harder to differentiate between real failures from the expected ones due to age. |
To be fair, all of it can be worked around and implemented to have perfect periodic re-balancing with streaming but depending on how much sophisticated support one wanted, it could complicate the exporter and/or receiver code considerably. |
We’ve been facing load balancing issues with long lived bi-directional gRPC streams. We’ve tried a number of solutions including setting a maximum age on connections, terminating connections intelligently under high imbalance and switching to unary gRPC communication. While other solutions work as well, they are considerably more complex than switching to unary communication and require some amount of tuning to work properly under different workloads. Unary communication on the other hands works out of the box for most scenarios without the need to tweak settings or setup any additional components. Unary communication adds a bit of additional latency but we think the cost is worth paying to get out of the box load balancing.
We propose to add an optional unary trace exporter to the OC collector and agent while keeping the streaming communication as the default option. This should be a very low risk change as it does not change anything for existing or new users. Users can opt-in to using unary communication instead of streaming if they want. This is also orthogonal to streaming communication as features can still be added or additional components can be shipped to help LB in streaming mode.
Implications
The biggest implication of using unary communication is that node and process information will always have to be attached to the span batches or spans. This is already true for collector to collector and agent to collector communication even though we might add some additional checks. We do not plan on adding unary support for ocagent when being used as an exporter by the Go OC instrumentation library. Another downside is a bit of additional latency but it’s an acceptable price to pay in most scenarios as overall throughput or resource requirements are not really affected.
Backward compatibility
The change is 100% backward compatible and does not break any existing workloads or change any defaults. By default the OC will continue to use streaming RPC.
Load blancing
In the example below, four collectors were ingesting ~100k spans per second but the load was not evenly distributed with gRPC streams. The load automatically balanced almost perfectly once the source collector was re-configured to use unary communication instead of streaming. After the change, load distribution changed to ~25k spans/sec per collector.