Discussion of Consistent Routing by TraceID #1724

joe-elliott · 2020-09-02T18:39:11Z

Need

Creating this issue to continue a discussion from the Otel/Agent Collector meeting. We have need for a custom exporter with the following attributes. Our need is specifically for an OTLP exporter, but this could be done in a way that is exporter agnostic.

Routes to a set of backends based on trace id such that spans with the same trace id always end up on the same backend.
The set of backends will be dynamic and the exporter must be able to handle additions and deletions.
Routing decision is made based on a consistent hash ring to reduce shuffling when backends are added/removed.

It appears that there is some work in related areas such as batching by traceid and routing to specific exporters. Our needs, however, are different and I would like to have a discussion about the feasibility of including them in the OTel Collector.

Proposal

We have no specific proposal at the moment. These needs are fairly complex and individual pieces might need specific proposals. This issue exists to discuss if the OTel Collector maintainers are willing to go down this path and next steps to accomplish this.

cc @annanay25 @jpkrohling

chris-smith-zocdoc · 2020-09-02T22:35:33Z

There was a desire to keep routing concerns out of the collector expressed in this pr

As an alternative proposal, I think it would be useful to have a mechanism (processor) that would allow the collector to add a http header to the http calls that the exporters make. This would allow an external system (like Envoy) to perform the consistent hashing and route each span to a consistent destination.

OTel Collector (adds hashed traceid header) -> Envoy -> OTel Collector (tail sampler)

I have a working prototype using this design but I don't think it works with all exporters currently (it works for OTLP in my testing). The questionable piece is metadata.AppendToOutgoingContext

func (proc *traceheaderprocessor) ConsumeTraceData(ctx context.Context, td consumerdata.TraceData) error {
	spansByBucket := make(map[uint32][]*tracepb.Span, 0)

	for _, span := range td.Spans {
		// from probabilistic sampler
		var numHashBuckets uint32 = 0x4000
		var bitMaskHashBuckets = numHashBuckets - 1
		bucket := hash(span.TraceId, 22) & bitMaskHashBuckets

		if _, ok := spansByBucket[bucket]; !ok {
			spansByBucket[bucket] = make([]*tracepb.Span, 0)
		}
		spansByBucket[bucket] = append(spansByBucket[bucket], span)
	}

	for bucket, spans := range spansByBucket {
		newTraceData := consumerdata.TraceData{
			Node:         td.Node,
			Resource:     td.Resource,
			SourceFormat: td.SourceFormat,
		}

		newTraceData.Spans = spans

		bucketStr := strconv.FormatUint(uint64(bucket), 10)
		grpCtx := metadata.AppendToOutgoingContext(ctx, "OTel-TraceBucket", bucketStr)

		if err := proc.nextConsumer.ConsumeTraceData(grpCtx, newTraceData); err != nil {
			return err
		}
	}

	return nil
}

While this will allow consistent routing to work, there are still questions about how to safely update the fleet of Tail Samplers without data loss/suboptimal sampling decisions

jpkrohling · 2020-09-03T08:16:39Z

Glad to see this! As mentioned in the call yesterday, I had this in my radar for some time now and my initial idea was very close to what @annanay25 implemented before, except that it would be an exporter instead of a processor and it wouldn't redirect spans to peers, but rather, to backends. This way, there could be 20 backend instances and only one loadbalancer.

     +-----------------------+
     |       Client          |
     | (jaeger, otlp, ...)   |
     |                       |
     +----------+------------+
                |
                |
                |
     +----------v------------+
     |     Load balancer     |
     +-----------------------+
     |                       |
     |     groupbytrace      |
     |                       |
     +-----------------------+
     |                       |
     |      loadbalance      |
     |                       |
     +-----+----------+------+
           |          |
+----------v----+ +---v-----------+
|               | |               |
|   otelcol     | |  otelcol      |
|   backend #1  | |  backend #n   |
|               | |               |
+---------------+ +---------------+

The idea is that an otelcol instance would act alone(*) as a TraceID-aware loadbalancer, getting the list of possible backends from a service discovery mechanism (DNS at first).

Using a consistent hashing mechanism to determine the backend was also in my mind, but there's one problem that even consistent hashing can't solve: when the ring changes, 1/n of the spectrum would be shuffled, which means that spans that previously reached a backend might now reach a different backend. This is one of the reasons for the groupbytrace processor: once a trace is released by that processor, the chances that the trace is complete is high, so we atomically flush the entire trace to the destination.

The groupbytrace does come with a memory hit, though, which might be a problem for high-throughput scenarios. For scenarios with minimal changes to the backend, the groupbytrace could just be left out of the loadbalancer.

About the backends: I don't think it's feasible to allow it to have its receivers configurable. I would require the backend's receiver to be OTLP, so that we can use gRPC between the load balancer and the backend. This would offload a bunch of concerns to the gRPC client (retry, backoff, ...).

And finally: note that if your use case is to apply tail-based sampling, you don't need the load balancer when you have the groupbytrace processor: you can have the groupbytrace in a fronting collector, and each backend would then just have the sampling decision being made right away, as it would already have a complete trace to make the decision.

(*) The tricky part is that the loadbalancer would be seen as a critical piece of infra, and as such, needs to be highly-available itself. And with that, synchronization of the ring becomes another problem entirely (leader election, consistent view of the data source, ...) This is something I was expecting to deal with in a future iteration, once we gain more knowledge about how good everything else works.

bogdandrutu · 2020-09-03T18:07:12Z

This is how I see this, having a LBExporter that can start multiple instances of a "delegated" exporter and route requests based on some rules:

     +---------------------+
     |       Client        |
     | (jaeger, otlp, ...) |
     +----------+----------+
                |
                |
     +----------v----------+
     |     groupbytrace    |
     |       processor     |
     +---------------------+
                |
                |
  +-------------v---------------+
  |        loadbalance          |
  |          exporter           |
  |                             |
  | exporter #1 ... exporter #N |
  |                             |
  +-----------------------------+

This way we don't have to change the current pipeline design

jpkrohling · 2020-09-04T08:04:48Z

@bogdandrutu how would the exporter #.1 know to which host to send the data to?

joe-elliott · 2020-09-04T13:34:14Z

Our needs are exactly in line with what @jpkrohling is describing. Dynamic discovery of backends and maintaining a hash ring for minimal disruption during scaling events. I also had not considered, but really like the idea of relying on the groupbytrace processor to reduce fractured traces when scaling events do occur.

synchronization of the ring becomes another problem entirely

Agree this needs some attention. We are seeing success using both gossip and external key value stores to maintain a ring depending on the durability requirements.

bogdandrutu · 2020-09-07T14:52:34Z

@jpkrohling "exporter # 1" is an instance of an exporter (otlp, etc.) created by the lb exporter. The lb exporter is a composed exporter which does the handling of the config changes and creates / deletes instances of real exporters as well as routes requests to the specific instance.

So in the pipeline you connect only the lb exporter which then does routing to the right instance "exporter # X" which is just a simple exporter like today

jpkrohling · 2020-09-07T15:12:23Z

The lb exporter is a composed exporter which does the handling of the config changes and creates / deletes instances of real exporters as well as routes requests to the specific instance.

When the list of backend collectors changes, what happens? I'm trying to wrap my head around your proposal.

bogdandrutu · 2020-09-08T18:38:55Z

Depends how we want to implement the "configuration watch":

If LB is configured to retrieve this configuration from an http endpoint from example, then the overall collector config/pipeline does not change, so the LB composite exporter should shutdown old instances and start new instances based on the config.
If the LB has a static config with all the endpoints, then somehow the collector config needs to be updated. If that is the case we should rely on the collector mechanism to watch for config changes that will shutdown all the components and restart new components with the new config.

I think you try to implement the first option, but I gave you both options to make sure we are on the same page.

jpkrohling · 2020-09-09T07:23:52Z

In a first implementation, the list of available backends should come from DNS ("A" records), as is usual for this kind of workload, especially when considering cloud-native platforms, like Kubernetes. This is also what gRPC client-side load balancing does. A stretch goal would be to have the possibility of reading it from etcd as well, which provides a few benefits covering intricacies of distributed systems.

then the overall collector config/pipeline does not change, so the LB composite exporter should shutdown old instances and start new instances based on the config

This is what I'm not getting. Did you mean that the whole collector's configuration would come from an HTTP endpoint? Or just the list of backends? How would the collector's configuration look like in your opinion?

bogdandrutu · 2020-09-11T18:35:29Z

exporters:
  lb:
    config_endpoint: dns_address
    exporter_type: "otlp" # we can support OTLP/Jaeger/etc.
    exporter_config:
      - some specific config to otlp except endpoint

The config_endpoint returns something trace_id_range: -> endpoint, then when a config is received the lb-exporter will create len(config_endpoint) instances of otlp exporter and will do the redirect based on the trace_id_range

jpkrohling · 2020-09-14T08:28:13Z

It means that there's another process, somewhere, calculating and rebalancing which backends will receive each trace ID range? We'd still have the leader election/data sync problem that we have with DNS... A "batteries included" approach would probably make users lives easier in my opinion.

If I understood your proposal correctly, the load balancer exporter would have to know details about the underlying exporters, as the "endpoint" is configured differently depending on the exporter. See the Kafka exporter, for example.

I think we should really do the easy solution, at least for the first implementation:

assume there's only one lb instance, so that we don't have any need to do leader election/data sync between instances. Once we hear from users and see how this looks in production, we can implement a solution based on the real problem we see
export only to OTLP, with the assumption that the second-level backends are OpenTelemetry-compatible backends
get a list of backends from DNS and rebuild the ring based on this new list

Once we have a solution out there, let's see what people need. If they need to recalculate the ring by themselves, we can add support for the endpoint configuration.

jpkrohling · 2020-09-18T15:37:05Z

@joe-elliott, @bogdandrutu would you be open to review a proposal as a PR with the ideas that I have in mind? It might be easier to discuss by looking at a concrete code.

jpkrohling · 2020-10-07T15:21:14Z

I'm still working on this one, but I have some code to show already: https://github.com/jpkrohling/opentelemetry-collector-contrib/tree/jpkrohling/1724-loadbalancing-processor

Each commit in this branch is a step in the direction for the full solution, and I believe it's easier to check them individually.

~~I'm still missing the actual DNS resolver logic and a full end to end test, but I think it has the shape already of what we discussed here.~~

The code is pretty much complete for a first draft, but I'm yet to run some e2e tests.

jpkrohling · 2020-10-09T12:50:25Z

Opened open-telemetry/opentelemetry-collector-contrib#1231 with the proposal.

jpkrohling · 2020-10-15T19:37:10Z

@bogdandrutu, @joe-elliott, how can we move forward with this? I realize that the PR/branch with the proposal looks scary, but the core is this:

During the bootstrap, a list of servers is obtained via a "resolver" (DNS or static):
https://github.com/jpkrohling/opentelemetry-collector-contrib/blob/61f3be21926c980ed7eb82b6ab030aa606a5e573/exporter/loadbalancingexporter/exporter.go#L104-L106

Whenever there's a backend change (new/removed), the ring is replaced:
https://github.com/jpkrohling/opentelemetry-collector-contrib/blob/61f3be21926c980ed7eb82b6ab030aa606a5e573/exporter/loadbalancingexporter/exporter.go#L111-L128

When a new backend is detected, a new exporter is created and added to a map[endpoint]component.TraceExporter:
https://github.com/jpkrohling/opentelemetry-collector-contrib/blob/61f3be21926c980ed7eb82b6ab030aa606a5e573/exporter/loadbalancingexporter/exporter.go#L134-L152

When a new batch arrives, the backend for that trace ID is selected and the batch is sent:
https://github.com/jpkrohling/opentelemetry-collector-contrib/blob/61f3be21926c980ed7eb82b6ab030aa606a5e573/exporter/loadbalancingexporter/exporter.go#L181-L207

joe-elliott · 2020-10-16T17:20:29Z

I don't know what the bar for merging to contrib is, but I'm confident that @jpkrohling has something with merging if we're talking about it. Grafana is swamped right now preparing for an upcoming conference but if this is in contrib once things settle we will experiment with it and give feedback.

@annanay25

* Added the backend resolver * Added the metrics definitions **Link to tracking Issue:** Partially solves open-telemetry/opentelemetry-collector#1724, next step after #1349 **Testing:** unit tests **Documentation:** godoc Signed-off-by: Juraci Paixão Kröhling <[email protected]>

…tlp (open-telemetry#1724) * Bump google.golang.org/protobuf from 1.25.0 to 1.26.0 in /exporters/otlp Bumps [google.golang.org/protobuf](https://github.com/protocolbuffers/protobuf-go) from 1.25.0 to 1.26.0. - [Release notes](https://github.com/protocolbuffers/protobuf-go/releases) - [Changelog](https://github.com/protocolbuffers/protobuf-go/blob/master/release.bash) - [Commits](protocolbuffers/protobuf-go@v1.25.0...v1.26.0) Signed-off-by: dependabot[bot] <[email protected]> * Auto-fix go.sum changes in dependent modules Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: MrAlias <[email protected]>

joe-elliott added the feature request label Sep 2, 2020

jpkrohling mentioned this issue Oct 7, 2020

Horizontally scale tail sampling processor #541

Closed

jpkrohling mentioned this issue Oct 9, 2020

Trace ID aware load-balancing exporter open-telemetry/opentelemetry-collector-contrib#1231

Closed

jpkrohling mentioned this issue Oct 16, 2020

Discuss post-trace (tail-based) sampling jaegertracing/jaeger#425

Closed

jpkrohling mentioned this issue Nov 10, 2020

Trace ID aware load-balancing exporter - 4/4 open-telemetry/opentelemetry-collector-contrib#1542

Merged

bogdandrutu closed this as completed in open-telemetry/opentelemetry-collector-contrib#1542 Nov 13, 2020

jpkrohling mentioned this issue Nov 27, 2020

Rolling update of collector can stop the periodical check of DNS updates in loadbalancingexporter open-telemetry/opentelemetry-collector-contrib#1690

Closed

andrewhsu added the enhancement label Jan 6, 2021

nasushkov mentioned this issue Feb 14, 2022

loadbalancing exporter is not sending span data to the back-end in a docker-compose setup open-telemetry/opentelemetry-collector-contrib#7910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion of Consistent Routing by TraceID #1724

Discussion of Consistent Routing by TraceID #1724

joe-elliott commented Sep 2, 2020

chris-smith-zocdoc commented Sep 2, 2020

jpkrohling commented Sep 3, 2020

bogdandrutu commented Sep 3, 2020

jpkrohling commented Sep 4, 2020 •

edited

Loading

joe-elliott commented Sep 4, 2020 •

edited

Loading

bogdandrutu commented Sep 7, 2020 •

edited

Loading

jpkrohling commented Sep 7, 2020

bogdandrutu commented Sep 8, 2020

jpkrohling commented Sep 9, 2020

bogdandrutu commented Sep 11, 2020

jpkrohling commented Sep 14, 2020

jpkrohling commented Sep 18, 2020

jpkrohling commented Oct 7, 2020 •

edited

Loading

jpkrohling commented Oct 9, 2020

jpkrohling commented Oct 15, 2020

joe-elliott commented Oct 16, 2020

Discussion of Consistent Routing by TraceID #1724

Discussion of Consistent Routing by TraceID #1724

Comments

joe-elliott commented Sep 2, 2020

chris-smith-zocdoc commented Sep 2, 2020

jpkrohling commented Sep 3, 2020

bogdandrutu commented Sep 3, 2020

jpkrohling commented Sep 4, 2020 • edited Loading

joe-elliott commented Sep 4, 2020 • edited Loading

bogdandrutu commented Sep 7, 2020 • edited Loading

jpkrohling commented Sep 7, 2020

bogdandrutu commented Sep 8, 2020

jpkrohling commented Sep 9, 2020

bogdandrutu commented Sep 11, 2020

jpkrohling commented Sep 14, 2020

jpkrohling commented Sep 18, 2020

jpkrohling commented Oct 7, 2020 • edited Loading

jpkrohling commented Oct 9, 2020

jpkrohling commented Oct 15, 2020

joe-elliott commented Oct 16, 2020

jpkrohling commented Sep 4, 2020 •

edited

Loading

joe-elliott commented Sep 4, 2020 •

edited

Loading

bogdandrutu commented Sep 7, 2020 •

edited

Loading

jpkrohling commented Oct 7, 2020 •

edited

Loading