[processor/tailsamplingprocessor] config allows duplicate policy names resulting in metrics collision #26726

jmsnll · 2023-09-18T10:50:26Z

Component(s)

processor/tailsampling

What happened?

Description

The name of each sampling policy is attached to the count_traces_sampled metric as an attribute which records how many traces each policy is responsible for sampling.

If a user provides two policies with identical names then a collision occurs and the counts for the two policies are combined and reported under one metric.

Steps to Reproduce

Given any two or more policies with identical names:

    policies:
      - name: policy-a
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: policy-a
        type: probabilistic
        probabilistic:
          sampling_percentage: 25

The count_traces_sampled metric will report a combined count of both policies:

otelcol_processor_tail_sampling_count_traces_sampled{policy="policy-a",sampled="false"} 249
otelcol_processor_tail_sampling_count_traces_sampled{policy="policy-a",sampled="true"} 151

Suggested fix

func (c *Config) Validate() error {
	policyNames := make(map[string]bool)

	// checks each policy is uniquely named to ensure accuracy of `count_traces_sampled` metric
	// sub policies contained within `and` or `composite` policies do not need to be checked
	for _, p := range c.PolicyCfgs {
		if _, exists := policyNames[p.Name]; exists {
			return fmt.Errorf("sampling policies must have unique names")
		}
		policyNames[p.Name] = true
	}

	return nil
}

Expected Result

Config validation checks each sampling policy's name is unique and returns an error, preventing the collector from starting.

Actual Result

Collector starts.

Collector version

v0.85.0

Environment information

Environment

OS: Mac OS Ventura 13.5.2 (22G91)
Compiler(if manually compiled): go version go1.21.1 darwin/arm64

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  tail_sampling:
    decision_wait: 5s
    policies:
      - name: policy-a
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: policy-a
        type: probabilistic
        probabilistic:
          sampling_percentage: 25
exporters:
  logging:
    verbosity: detailed
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [logging]

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-09-18T10:50:46Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jmsnll · 2023-09-18T11:08:33Z

Although the sub-policies in composite policies aren't used for the count_traces_sampled metric, I'm realising the names still do need to be unique due to how rate allocations are assigned.

bryan-aguilar · 2023-09-18T16:09:33Z

Enforcing policy names to be unique sounds like a reasonable solution to this. I think we would need to gradually roll this out with a feature gate because it could be a breaking change for existing configurations.

bryan-aguilar · 2023-09-18T16:11:05Z

@jpkrohling what do you think? Could there be another option that is less impactful to existing configurations?

jmsnll · 2023-09-18T17:05:13Z

At the time, the only alternative solution I could think of was to append a suffix to duplicate policy names (determined by the policies position in the list) when the configuration is initially loaded into memory. So for the following:

processors:
    policies:
      - name: colliding-policy
        ...
      - name: rate-limit
        ...
      - name: error-span-filter
        ...
      - name: colliding-policy
        ...

You could choose from either of these approaches:

Auto-incrementing: The first occurrence of colliding-policy would become colliding-policy-1, the second colliding-policy-2, and so on.
Policy Index: Alternatively, you could use the policy's index instead of tracking a specific count for each name. For example, colliding-policy-0 for the first occurrence and colliding-policy-3 for the second.

Existing configurations would still work, although with a reduced emphasis on correctness. One potential drawback I can see to this approach is that if users were to add yet another policy with the same name to an existing configuration, the exported names of policies could potentially change depending on where they position the new policy in the list. But I'd hope at that stage they would fix the issue of having duplicate names.

jmsnll · 2023-09-20T08:50:09Z

Marking as closed as covered by #27016

jmsnll added bug Something isn't working needs triage New item requiring triage labels Sep 18, 2023

github-actions bot added the processor/tailsampling Tail sampling processor label Sep 18, 2023

bryan-aguilar added good first issue Good for newcomers priority:p2 Medium and removed needs triage New item requiring triage labels Sep 18, 2023

github-actions bot mentioned this issue Sep 19, 2023

Weekly Report: 2023-09-12 - 2023-09-19 kevinslin/opentelemetry-collector-contrib#26

Open

jmsnll closed this as completed Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/tailsamplingprocessor] config allows duplicate policy names resulting in metrics collision #26726

[processor/tailsamplingprocessor] config allows duplicate policy names resulting in metrics collision #26726

jmsnll commented Sep 18, 2023 •

edited

Loading

github-actions bot commented Sep 18, 2023

jmsnll commented Sep 18, 2023 •

edited

Loading

bryan-aguilar commented Sep 18, 2023

bryan-aguilar commented Sep 18, 2023

jmsnll commented Sep 18, 2023

jmsnll commented Sep 20, 2023

[processor/tailsamplingprocessor] config allows duplicate policy names resulting in metrics collision #26726

[processor/tailsamplingprocessor] config allows duplicate policy names resulting in metrics collision #26726

Comments

jmsnll commented Sep 18, 2023 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Suggested fix

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Sep 18, 2023

jmsnll commented Sep 18, 2023 • edited Loading

bryan-aguilar commented Sep 18, 2023

bryan-aguilar commented Sep 18, 2023

jmsnll commented Sep 18, 2023

jmsnll commented Sep 20, 2023

jmsnll commented Sep 18, 2023 •

edited

Loading

jmsnll commented Sep 18, 2023 •

edited

Loading