Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of Consistent Routing by TraceID #1724

Closed
joe-elliott opened this issue Sep 2, 2020 · 16 comments · Fixed by open-telemetry/opentelemetry-collector-contrib#1542
Closed
Labels
enhancement New feature or request

Comments

@joe-elliott
Copy link
Contributor

Need

Creating this issue to continue a discussion from the Otel/Agent Collector meeting. We have need for a custom exporter with the following attributes. Our need is specifically for an OTLP exporter, but this could be done in a way that is exporter agnostic.

  • Routes to a set of backends based on trace id such that spans with the same trace id always end up on the same backend.
  • The set of backends will be dynamic and the exporter must be able to handle additions and deletions.
  • Routing decision is made based on a consistent hash ring to reduce shuffling when backends are added/removed.

It appears that there is some work in related areas such as batching by traceid and routing to specific exporters. Our needs, however, are different and I would like to have a discussion about the feasibility of including them in the OTel Collector.

Proposal

We have no specific proposal at the moment. These needs are fairly complex and individual pieces might need specific proposals. This issue exists to discuss if the OTel Collector maintainers are willing to go down this path and next steps to accomplish this.

cc @annanay25 @jpkrohling

@chris-smith-zocdoc
Copy link
Contributor

There was a desire to keep routing concerns out of the collector expressed in this pr

As an alternative proposal, I think it would be useful to have a mechanism (processor) that would allow the collector to add a http header to the http calls that the exporters make. This would allow an external system (like Envoy) to perform the consistent hashing and route each span to a consistent destination.

OTel Collector (adds hashed traceid header) -> Envoy -> OTel Collector (tail sampler)

I have a working prototype using this design but I don't think it works with all exporters currently (it works for OTLP in my testing). The questionable piece is metadata.AppendToOutgoingContext

func (proc *traceheaderprocessor) ConsumeTraceData(ctx context.Context, td consumerdata.TraceData) error {
	spansByBucket := make(map[uint32][]*tracepb.Span, 0)

	for _, span := range td.Spans {
		// from probabilistic sampler
		var numHashBuckets uint32 = 0x4000
		var bitMaskHashBuckets = numHashBuckets - 1
		bucket := hash(span.TraceId, 22) & bitMaskHashBuckets

		if _, ok := spansByBucket[bucket]; !ok {
			spansByBucket[bucket] = make([]*tracepb.Span, 0)
		}
		spansByBucket[bucket] = append(spansByBucket[bucket], span)
	}

	for bucket, spans := range spansByBucket {
		newTraceData := consumerdata.TraceData{
			Node:         td.Node,
			Resource:     td.Resource,
			SourceFormat: td.SourceFormat,
		}

		newTraceData.Spans = spans

		bucketStr := strconv.FormatUint(uint64(bucket), 10)
		grpCtx := metadata.AppendToOutgoingContext(ctx, "OTel-TraceBucket", bucketStr)

		if err := proc.nextConsumer.ConsumeTraceData(grpCtx, newTraceData); err != nil {
			return err
		}
	}

	return nil
}

While this will allow consistent routing to work, there are still questions about how to safely update the fleet of Tail Samplers without data loss/suboptimal sampling decisions

@jpkrohling
Copy link
Member

Glad to see this! As mentioned in the call yesterday, I had this in my radar for some time now and my initial idea was very close to what @annanay25 implemented before, except that it would be an exporter instead of a processor and it wouldn't redirect spans to peers, but rather, to backends. This way, there could be 20 backend instances and only one loadbalancer.

     +-----------------------+
     |       Client          |
     | (jaeger, otlp, ...)   |
     |                       |
     +----------+------------+
                |
                |
                |
     +----------v------------+
     |     Load balancer     |
     +-----------------------+
     |                       |
     |     groupbytrace      |
     |                       |
     +-----------------------+
     |                       |
     |      loadbalance      |
     |                       |
     +-----+----------+------+
           |          |
+----------v----+ +---v-----------+
|               | |               |
|   otelcol     | |  otelcol      |
|   backend #1  | |  backend #n   |
|               | |               |
+---------------+ +---------------+

The idea is that an otelcol instance would act alone(*) as a TraceID-aware loadbalancer, getting the list of possible backends from a service discovery mechanism (DNS at first).

Using a consistent hashing mechanism to determine the backend was also in my mind, but there's one problem that even consistent hashing can't solve: when the ring changes, 1/n of the spectrum would be shuffled, which means that spans that previously reached a backend might now reach a different backend. This is one of the reasons for the groupbytrace processor: once a trace is released by that processor, the chances that the trace is complete is high, so we atomically flush the entire trace to the destination.

The groupbytrace does come with a memory hit, though, which might be a problem for high-throughput scenarios. For scenarios with minimal changes to the backend, the groupbytrace could just be left out of the loadbalancer.

About the backends: I don't think it's feasible to allow it to have its receivers configurable. I would require the backend's receiver to be OTLP, so that we can use gRPC between the load balancer and the backend. This would offload a bunch of concerns to the gRPC client (retry, backoff, ...).

And finally: note that if your use case is to apply tail-based sampling, you don't need the load balancer when you have the groupbytrace processor: you can have the groupbytrace in a fronting collector, and each backend would then just have the sampling decision being made right away, as it would already have a complete trace to make the decision.

(*) The tricky part is that the loadbalancer would be seen as a critical piece of infra, and as such, needs to be highly-available itself. And with that, synchronization of the ring becomes another problem entirely (leader election, consistent view of the data source, ...) This is something I was expecting to deal with in a future iteration, once we gain more knowledge about how good everything else works.

@bogdandrutu
Copy link
Member

This is how I see this, having a LBExporter that can start multiple instances of a "delegated" exporter and route requests based on some rules:

     +---------------------+
     |       Client        |
     | (jaeger, otlp, ...) |
     +----------+----------+
                |
                |
     +----------v----------+
     |     groupbytrace    |
     |       processor     |
     +---------------------+
                |
                |
  +-------------v---------------+
  |        loadbalance          |
  |          exporter           |
  |                             |
  | exporter #1 ... exporter #N |
  |                             |
  +-----------------------------+

This way we don't have to change the current pipeline design

@jpkrohling
Copy link
Member

jpkrohling commented Sep 4, 2020

@bogdandrutu how would the exporter #.1 know to which host to send the data to?

@joe-elliott
Copy link
Contributor Author

joe-elliott commented Sep 4, 2020

Our needs are exactly in line with what @jpkrohling is describing. Dynamic discovery of backends and maintaining a hash ring for minimal disruption during scaling events. I also had not considered, but really like the idea of relying on the groupbytrace processor to reduce fractured traces when scaling events do occur.

synchronization of the ring becomes another problem entirely

Agree this needs some attention. We are seeing success using both gossip and external key value stores to maintain a ring depending on the durability requirements.

@bogdandrutu
Copy link
Member

bogdandrutu commented Sep 7, 2020

@jpkrohling "exporter # 1" is an instance of an exporter (otlp, etc.) created by the lb exporter. The lb exporter is a composed exporter which does the handling of the config changes and creates / deletes instances of real exporters as well as routes requests to the specific instance.

So in the pipeline you connect only the lb exporter which then does routing to the right instance "exporter # X" which is just a simple exporter like today

@jpkrohling
Copy link
Member

The lb exporter is a composed exporter which does the handling of the config changes and creates / deletes instances of real exporters as well as routes requests to the specific instance.

When the list of backend collectors changes, what happens? I'm trying to wrap my head around your proposal.

@bogdandrutu
Copy link
Member

Depends how we want to implement the "configuration watch":

  1. If LB is configured to retrieve this configuration from an http endpoint from example, then the overall collector config/pipeline does not change, so the LB composite exporter should shutdown old instances and start new instances based on the config.
  2. If the LB has a static config with all the endpoints, then somehow the collector config needs to be updated. If that is the case we should rely on the collector mechanism to watch for config changes that will shutdown all the components and restart new components with the new config.

I think you try to implement the first option, but I gave you both options to make sure we are on the same page.

@jpkrohling
Copy link
Member

In a first implementation, the list of available backends should come from DNS ("A" records), as is usual for this kind of workload, especially when considering cloud-native platforms, like Kubernetes. This is also what gRPC client-side load balancing does. A stretch goal would be to have the possibility of reading it from etcd as well, which provides a few benefits covering intricacies of distributed systems.

then the overall collector config/pipeline does not change, so the LB composite exporter should shutdown old instances and start new instances based on the config

This is what I'm not getting. Did you mean that the whole collector's configuration would come from an HTTP endpoint? Or just the list of backends? How would the collector's configuration look like in your opinion?

@bogdandrutu
Copy link
Member

exporters:
  lb:
    config_endpoint: dns_address
    exporter_type: "otlp" # we can support OTLP/Jaeger/etc.
    exporter_config:
      - some specific config to otlp except endpoint

The config_endpoint returns something trace_id_range: -> endpoint, then when a config is received the lb-exporter will create len(config_endpoint) instances of otlp exporter and will do the redirect based on the trace_id_range

@jpkrohling
Copy link
Member

It means that there's another process, somewhere, calculating and rebalancing which backends will receive each trace ID range? We'd still have the leader election/data sync problem that we have with DNS... A "batteries included" approach would probably make users lives easier in my opinion.

If I understood your proposal correctly, the load balancer exporter would have to know details about the underlying exporters, as the "endpoint" is configured differently depending on the exporter. See the Kafka exporter, for example.

I think we should really do the easy solution, at least for the first implementation:

  1. assume there's only one lb instance, so that we don't have any need to do leader election/data sync between instances. Once we hear from users and see how this looks in production, we can implement a solution based on the real problem we see
  2. export only to OTLP, with the assumption that the second-level backends are OpenTelemetry-compatible backends
  3. get a list of backends from DNS and rebuild the ring based on this new list

Once we have a solution out there, let's see what people need. If they need to recalculate the ring by themselves, we can add support for the endpoint configuration.

@jpkrohling
Copy link
Member

@joe-elliott, @bogdandrutu would you be open to review a proposal as a PR with the ideas that I have in mind? It might be easier to discuss by looking at a concrete code.

@jpkrohling
Copy link
Member

jpkrohling commented Oct 7, 2020

I'm still working on this one, but I have some code to show already: https://github.com/jpkrohling/opentelemetry-collector-contrib/tree/jpkrohling/1724-loadbalancing-processor

Each commit in this branch is a step in the direction for the full solution, and I believe it's easier to check them individually.

I'm still missing the actual DNS resolver logic and a full end to end test, but I think it has the shape already of what we discussed here.

The code is pretty much complete for a first draft, but I'm yet to run some e2e tests.

@jpkrohling
Copy link
Member

Opened open-telemetry/opentelemetry-collector-contrib#1231 with the proposal.

@jpkrohling
Copy link
Member

@bogdandrutu, @joe-elliott, how can we move forward with this? I realize that the PR/branch with the proposal looks scary, but the core is this:

During the bootstrap, a list of servers is obtained via a "resolver" (DNS or static):
https://github.com/jpkrohling/opentelemetry-collector-contrib/blob/61f3be21926c980ed7eb82b6ab030aa606a5e573/exporter/loadbalancingexporter/exporter.go#L104-L106

Whenever there's a backend change (new/removed), the ring is replaced:
https://github.com/jpkrohling/opentelemetry-collector-contrib/blob/61f3be21926c980ed7eb82b6ab030aa606a5e573/exporter/loadbalancingexporter/exporter.go#L111-L128

When a new backend is detected, a new exporter is created and added to a map[endpoint]component.TraceExporter:
https://github.com/jpkrohling/opentelemetry-collector-contrib/blob/61f3be21926c980ed7eb82b6ab030aa606a5e573/exporter/loadbalancingexporter/exporter.go#L134-L152

When a new batch arrives, the backend for that trace ID is selected and the batch is sent:
https://github.com/jpkrohling/opentelemetry-collector-contrib/blob/61f3be21926c980ed7eb82b6ab030aa606a5e573/exporter/loadbalancingexporter/exporter.go#L181-L207

@joe-elliott
Copy link
Contributor Author

I don't know what the bar for merging to contrib is, but I'm confident that @jpkrohling has something with merging if we're talking about it. Grafana is swamped right now preparing for an upcoming conference but if this is in contrib once things settle we will experiment with it and give feedback.

@annanay25

tigrannajaryan pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this issue Nov 10, 2020

Verified

This commit was signed with the committer’s verified signature. The key has expired.
miri64 Martine Lenders
* Added the backend resolver
* Added the metrics definitions

**Link to tracking Issue:** Partially solves open-telemetry/opentelemetry-collector#1724, next step after #1349

**Testing:** unit tests

**Documentation:** godoc

Signed-off-by: Juraci Paixão Kröhling <[email protected]>
@andrewhsu andrewhsu added the enhancement New feature or request label Jan 6, 2021
MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021

Verified

This commit was signed with the committer’s verified signature. The key has expired.
miri64 Martine Lenders
…tlp (open-telemetry#1724)

* Bump google.golang.org/protobuf from 1.25.0 to 1.26.0 in /exporters/otlp

Bumps [google.golang.org/protobuf](https://github.com/protocolbuffers/protobuf-go) from 1.25.0 to 1.26.0.
- [Release notes](https://github.com/protocolbuffers/protobuf-go/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf-go/blob/master/release.bash)
- [Commits](protocolbuffers/protobuf-go@v1.25.0...v1.26.0)

Signed-off-by: dependabot[bot] <[email protected]>

* Auto-fix go.sum changes in dependent modules

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: MrAlias <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants