Prevent setting a probabilistic decision maker for OTLP spans sampled by the Error Sampler #33586

keisku · 2025-01-30T15:42:15Z

What does this PR do?

When Error Sampler works, avoid setting neither _dd.p.dm:-4 or _dd.p.dm:-9 to prevent from applying ingestion_reason:probabilistic to OTLP errors.

Motivation

OTLP error spans always have ingestion_reason:probabilistic. It should be ingestion_reason:error when Error Sampler works.

Describe how you validated your changes

For testing, send OTLP error spans from Python with OTel SDK every 10 milliseconds.

We can see transition from ingestion_reason:probabilistic to ingestion_reason:error after this PR.

Tested with two patterns.

Set DD_OTLP_CONFIG_TRACES_PROBABILISTIC_SAMPLER_SAMPLING_PERCENTAGE=1 to run Error Sampler easily.
DD_APM_PROBABILISTIC_SAMPLER_ENABLED=true and DD_APM_PROBABILISTIC_SAMPLER_PERCENTAGE=0

docker-compose.yaml and Python app

services:
  agent:
    container_name: agent
    # Before
    # image: datadog/agent:7.62.0
    # After
    image: datadog/agent-dev:keisku-apms-14685-error-sampler-py3@sha256:2f7302ccb7c92484f0b57a590793eb85bc02e59b3029560709624cafbc664247
    volumes:
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/lib/cloud/data/instance-id:/var/lib/cloud/data/instance-id:ro
    pid: host
    environment:
      - DD_API_KEY
      - DD_HOSTNAME_FILE=/var/lib/cloud/data/instance-id
      - DD_ENV=docker-keisuke-ubuntu
      - DD_APM_ENABLED=true
      - DD_APM_NON_LOCAL_TRAFFIC=true
      - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
      - DD_OTLP_CONFIG_DEBUG_VERBOSITY=detailed
      # Pattern 1
      - DD_OTLP_CONFIG_TRACES_PROBABILISTIC_SAMPLER_SAMPLING_PERCENTAGE=1
      # Pattern 2
      # - DD_APM_PROBABILISTIC_SAMPLER_ENABLED=true
      # - DD_APM_PROBABILISTIC_SAMPLER_PERCENTAGE=0
  error-generator-py:
    container_name: error-generator-py
    build:
      context: ./python
      dockerfile: Dockerfile
    restart: always
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://agent:4317
      - OTEL_SERVICE_NAME=python-opentelemetry-error-generator
      - ERROR_GENERATOR_INTERVAL=10

FROM python:3.13-slim
RUN pip install opentelemetry-api opentelemetry-exporter-otlp opentelemetry-sdk
COPY --chmod=755 <<app.py /src/
import os
import time

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.trace import Status, StatusCode

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

otlp_exporter = OTLPSpanExporter(insecure=True)

span_processor = BatchSpanProcessor(otlp_exporter)

trace.get_tracer_provider().add_span_processor(span_processor)

def generate_error():
    with tracer.start_as_current_span("generate_error") as span:
        span.set_status(Status(StatusCode.ERROR))
        span.record_exception(Exception("This is an intentional error"))

interval = int(os.getenv("ERROR_GENERATOR_INTERVAL", 1000))

while True:
    generate_error()
    time.sleep(interval / 1000)
app.py
CMD ["python", "/src/app.py"]

span in JSON before PR

{
  "trace": {
    "root_id": "7981140922446959578",
    "spans": {
      "7981140922446959578": {
        "trace_id": "a78dd7365a74854ba7d4dd9c8fdef5ec",
        "span_id": "7981140922446959578",
        "parent_id": "0",
        "start": 1738296239.08292,
        "end": 1738296239.08297,
        "duration": 0.000050126,
        "error": 1,
        "status": "error",
        "type": "custom",
        "service": "python-opentelemetry-error-generator",
        "name": "main.internal",
        "resource": "generate_error",
        "resource_hash": "745adb92501bc84b",
        "meta": {
          "_dd.agent_hostname": "i-09df1e4674d40f561",
          "_dd.agent_rare_sampler.enabled": "false",
          "_dd.agent_version": "7.62.0",
          "_dd.error_tracking.fingerprints.stable.materials": "[{\"name\":\"SERVICE\",\"location\":{\"source\":\"EVENT\",\"path\":\"service\",\"ranges\":[[0,36]]}},{\"name\":\"ERROR_TYPE\",\"location\":{\"source\":\"EVENT\",\"path\":\"meta.error.type\",\"ranges\":[[0,9]]}},{\"name\":\"ERROR_MESSAGE\",\"location\":{\"source\":\"EVENT\",\"path\":\"meta.error.message\",\"ranges\":[[0,4],[5,7],[8,10],[11,22],[23,28]]}}]",
          "_dd.error_tracking.fingerprints.stable.source": "datadog",
          "_dd.error_tracking.fingerprints.stable.value": "88E7626C53C8576C39EE63B27FE21CAF",
          "_dd.error_tracking.fingerprints.stable.version": "10",
          "_dd.filter.id": "7LSJkDrRRue_7dNexVP2hw",
          "_dd.filter.type": "spans-errors-sampling-processor",
          "_dd.hostname": "i-09df1e4674d40f561",
          "_dd.issue.muted": "false",
          "_dd.issue.state": "OPEN",
          "_dd.language": "python",
          "_dd.p.dm": "-9",
          "_dd.p.ftid": "a78dd7365a74854ba7d4dd9c8fdef5ec",
          "_dd.p.tid": "a78dd7365a74854b",
          "_dd.span_events.has_exception": "true",
          "_dd.tracer_version": "otlp-1.29.0",
          "ddtags": "ingestion_reason:probabilistic",
          "error.fingerprint": "v10.88E7626C53C8576C39EE63B27FE21CAF",
          "error.message": "This is an intentional error",
          "error.stack": "Exception: This is an intentional error\n",
          "error.type": "Exception",
          "events": "[{\"time_unix_nano\":1738296239082965988,\"name\":\"exception\",\"attributes\":{\"exception.type\":\"Exception\",\"exception.message\":\"This is an intentional error\",\"exception.stacktrace\":\"Exception: This is an intentional error\\n\",\"exception.escaped\":\"False\"}}]",
          "issue.first_seen_version": "",
          "issue.id": "8e9acffe-ded3-11ef-81c0-da7ad0900002",
          "language": "python",
          "otel.library.name": "__main__",
          "otel.status_code": "Error",
          "otel.trace_id": "a78dd7365a74854ba7d4dd9c8fdef5ec",
          "span.kind": "internal",
          "telemetry.sdk.language": "python",
          "telemetry.sdk.name": "opentelemetry",
          "telemetry.sdk.version": "1.29.0"
        },
        "metrics": {
          "_dd.agent_errors_sampler.target_tps": 10,
          "_dd.agent_priority_sampler.target_tps": 10,
          "_dd.otlp_sr": 0.01,
          "_dd1.sr.esusr": 0.01,
          "_dd1.sr.esusr_trace": 0,
          "_sampling_priority_rate_v1": 0.101626016260163,
          "_sampling_priority_v1": 1,
          "_top_level": 1,
          "_trace_root": 1,
          "issue.age": 77683222,
          "issue.first_seen": 1738218559999
        },
        "host_id": 29538221517,
        "host_groups": [],
        "hostname": "i-09df1e4674d40f561-7591",
        "env": "docker-keisuke-ubuntu",
        "metadata": {
          "sds_info": []
        },
        "span_events": [
          {
            "name": "exception",
            "time_unix_nano": 1.738296239082966e+18,
            "attributes": {
              "exception.escaped": "False",
              "exception.message": "This is an intentional error",
              "exception.stacktrace": "Exception: This is an intentional error\n",
              "exception.type": "Exception"
            }
          }
        ],
        "ingestion_reason": "probabilistic",
        "children_ids": []
      }
    }
  },
  "orphaned": [],
  "is_truncated": false,
  "is_summary": false
}

span in JSON after PR

{
  "trace": {
    "root_id": "15910729098197527177",
    "spans": {
      "15910729098197527177": {
        "trace_id": "26e30be60330e6686d5e1c99f0ebf847",
        "span_id": "15910729098197527177",
        "parent_id": "0",
        "start": 1738297238.90397,
        "end": 1738297238.904,
        "duration": 0.000034457,
        "error": 1,
        "status": "error",
        "type": "custom",
        "service": "python-opentelemetry-error-generator",
        "name": "main.internal",
        "resource": "generate_error",
        "resource_hash": "745adb92501bc84b",
        "meta": {
          "_dd.agent_hostname": "i-09df1e4674d40f561",
          "_dd.agent_rare_sampler.enabled": "false",
          "_dd.agent_version": "7.64.0-devel+git.149.8974346",
          "_dd.error_tracking.fingerprints.stable.materials": "[{\"name\":\"SERVICE\",\"location\":{\"source\":\"EVENT\",\"path\":\"service\",\"ranges\":[[0,36]]}},{\"name\":\"ERROR_TYPE\",\"location\":{\"source\":\"EVENT\",\"path\":\"meta.error.type\",\"ranges\":[[0,9]]}},{\"name\":\"ERROR_MESSAGE\",\"location\":{\"source\":\"EVENT\",\"path\":\"meta.error.message\",\"ranges\":[[0,4],[5,7],[8,10],[11,22],[23,28]]}}]",
          "_dd.error_tracking.fingerprints.stable.source": "datadog",
          "_dd.error_tracking.fingerprints.stable.value": "88E7626C53C8576C39EE63B27FE21CAF",
          "_dd.error_tracking.fingerprints.stable.version": "10",
          "_dd.filter.id": "7LSJkDrRRue_7dNexVP2hw",
          "_dd.filter.type": "spans-errors-sampling-processor",
          "_dd.hostname": "i-09df1e4674d40f561",
          "_dd.issue.muted": "false",
          "_dd.issue.state": "OPEN",
          "_dd.language": "python",
          "_dd.p.ftid": "26e30be60330e6686d5e1c99f0ebf847",
          "_dd.p.tid": "26e30be60330e668",
          "_dd.span_events.has_exception": "true",
          "_dd.tracer_version": "otlp-1.29.0",
          "ddtags": "ingestion_reason:error",
          "error.fingerprint": "v10.88E7626C53C8576C39EE63B27FE21CAF",
          "error.message": "This is an intentional error",
          "error.stack": "Exception: This is an intentional error\n",
          "error.type": "exception",
          "events": "[{\"time_unix_nano\":1738297238903998680,\"name\":\"exception\",\"attributes\":{\"exception.type\":\"Exception\",\"exception.message\":\"This is an intentional error\",\"exception.stacktrace\":\"Exception: This is an intentional error\\n\",\"exception.escaped\":\"False\"}}]",
          "issue.first_seen_version": "",
          "issue.id": "8e9acffe-ded3-11ef-81c0-da7ad0900002",
          "language": "python",
          "otel.library.name": "__main__",
          "otel.status_code": "Error",
          "otel.trace_id": "26e30be60330e6686d5e1c99f0ebf847",
          "span.kind": "internal",
          "telemetry.sdk.language": "python",
          "telemetry.sdk.name": "opentelemetry",
          "telemetry.sdk.version": "1.29.0"
        },
        "metrics": {
          "_dd.agent_errors_sampler.target_tps": 10,
          "_dd.agent_priority_sampler.target_tps": 10,
          "_dd.errors_sr": 0.0512820512820513,
          "_dd.otlp_sr": 0.01,
          "_dd1.sr.esusr": 0.01,
          "_dd1.sr.esusr_trace": 0,
          "_sampling_priority_v1": 0,
          "_top_level": 1,
          "_trace_root": 1,
          "issue.age": 78682339,
          "issue.first_seen": 1738218559999
        },
        "host_id": 29538221517,
        "host_groups": [],
        "hostname": "i-09df1e4674d40f561-7591",
        "env": "docker-keisuke-ubuntu",
        "metadata": {
          "sds_info": []
        },
        "span_events": [
          {
            "name": "exception",
            "time_unix_nano": 1.7382972389039987e+18,
            "attributes": {
              "exception.escaped": "False",
              "exception.message": "This is an intentional error",
              "exception.stacktrace": "Exception: This is an intentional error\n",
              "exception.type": "Exception"
            }
          }
        ],
        "ingestion_reason": "error",
        "children_ids": []
      }
    }
  },
  "orphaned": [],
  "is_truncated": false,
  "is_summary": false
}

Possible Drawbacks / Trade-offs

Additional Notes

Pattern 1

datadog-agent/pkg/trace/agent/agent.go

Lines 690 to 700 in 91794f6

    
           if hasPriority { 
        
           	if a.PrioritySampler.Sample(now, pt.TraceChunk, pt.Root, pt.TracerEnv, pt.ClientDroppedP0sWeight) { 
        
           		return true, true 
        
           	} 
        
           } else if a.NoPrioritySampler.Sample(now, pt.TraceChunk.Spans, pt.Root, pt.TracerEnv) { 
        
           	return true, true 
        
           } 
        
           if traceContainsError(pt.TraceChunk.Spans, false) { 
        
           	return a.ErrorsSampler.Sample(now, pt.TraceChunk.Spans, pt.Root, pt.TracerEnv), true 
        
           }

Pattern 2

datadog-agent/pkg/trace/agent/agent.go

Lines 653 to 665 in 91794f6

    
           if a.conf.ProbabilisticSamplerEnabled { 
        
           	if rare { 
        
           		return true, true 
        
           	} 
        
           	if a.ProbabilisticSampler.Sample(pt.Root) { 
        
           		pt.TraceChunk.Tags[tagDecisionMaker] = probabilitySampling 
        
           		return true, true 
        
           	} 
        
           	if traceContainsError(pt.TraceChunk.Spans, false) { 
        
           		return a.ErrorsSampler.Sample(now, pt.TraceChunk.Spans, pt.Root, pt.TracerEnv), true 
        
           	} 
        
           	return false, true 
        
           }

agent-platform-auto-pr · 2025-01-30T15:43:37Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=54641452 --os-family=ubuntu

Note: This applies to commit 21d947b

agent-platform-auto-pr · 2025-01-30T15:45:32Z

Uncompressed package size comparison

Comparison with ancestor 91794f654cad0d5a4ab1b455dfe626f30d8cb890

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-x86_64-rpm	0.00MB	✅	891.92MB	891.92MB	0.50MB
datadog-agent-x86_64-suse	0.00MB	✅	891.92MB	891.92MB	0.50MB
datadog-agent-aarch64-rpm	0.00MB	✅	879.69MB	879.69MB	0.50MB
datadog-agent-amd64-deb	0.00MB	✅	882.18MB	882.18MB	0.50MB
datadog-agent-arm64-deb	0.00MB	✅	869.97MB	869.97MB	0.50MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	59.02MB	59.02MB	0.50MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	59.10MB	59.10MB	0.50MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	59.10MB	59.10MB	0.50MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	56.50MB	56.50MB	0.50MB
datadog-heroku-agent-amd64-deb	0.00MB	✅	461.46MB	461.46MB	0.50MB
datadog-iot-agent-amd64-deb	0.00MB	✅	93.81MB	93.81MB	0.50MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	93.88MB	93.88MB	0.50MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	93.88MB	93.88MB	0.50MB
datadog-iot-agent-arm64-deb	0.00MB	✅	89.87MB	89.87MB	0.50MB
datadog-iot-agent-aarch64-rpm	0.00MB	✅	89.94MB	89.94MB	0.50MB

Decision

✅ Passed

cit-pr-commenter · 2025-01-30T16:09:19Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 2534eb69-3617-4f39-859d-cccbc9f55010

Baseline: 91794f6
Comparison: 21d947b
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_logs	% cpu utilization	+2.00	[-1.10, +5.09]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+1.48	[+0.60, +2.37]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.33	[-0.45, +1.11]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.30	[+0.24, +0.36]	1	Logs bounds checks dashboard
➖	file_to_blackhole_500ms_latency	egress throughput	+0.12	[-0.66, +0.89]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.01	[-0.87, +0.90]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	+0.00	[-0.64, +0.64]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	-0.00	[-0.78, +0.78]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.29, +0.28]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.02, +0.01]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	-0.02	[-0.84, +0.80]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.02	[-0.74, +0.69]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	-0.16	[-0.64, +0.31]	1	Logs
➖	quality_gate_idle	memory utilization	-0.37	[-0.42, -0.33]	1	Logs bounds checks dashboard
➖	file_tree	memory utilization	-0.77	[-0.84, -0.71]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-1.01	[-1.09, -0.94]	1	Logs

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.

songy23

Approval for OTel

releasenotes/notes/fix-ingestion-reason-for-otlp-spans-71a29c0ffb8f7ade.yaml

keisku · 2025-02-01T00:46:54Z

pkg/config/config_template.yaml

+      ## If `apm_config.probabilistic_sampler.enabled` is enabled, this config is ignored, `apm_config.probabilistic_sampler.enabled.sampling_percentage`
+      ## is used instead.


This behavior has existed before, but it has now been explicitly documented. cc @ajgajg1134

ProbabilisticSampler is not affected by SamplingPriority of OTLPReceiver’s probabilistic sampler decision.

datadog-agent/pkg/trace/agent/agent.go

Lines 653 to 665 in 91794f6

if a.conf.ProbabilisticSamplerEnabled {

if rare {

return true, true

}

if a.ProbabilisticSampler.Sample(pt.Root) {

pt.TraceChunk.Tags[tagDecisionMaker] = probabilitySampling

return true, true

}

if traceContainsError(pt.TraceChunk.Spans, false) {

return a.ErrorsSampler.Sample(now, pt.TraceChunk.Spans, pt.Root, pt.TracerEnv), true

}

return false, true

}

datadog-agent/pkg/trace/sampler/probabilistic.go

Lines 113 to 168 in dcd0631

// Sample a trace given the chunk's root span, returns true if the trace should be kept

func (ps *ProbabilisticSampler) Sample(root *trace.Span) (sampled bool) {

if !ps.enabled {

return false

}

defer func() {

ps.metrics.record(sampled, newMetricsKey(root.Service, "", nil))

}()

tid := make([]byte, 16)

var err error

if !ps.fullTraceIDMode {

binary.BigEndian.PutUint64(tid, root.TraceID)

} else {

tid, err = get128BitTraceID(root)

}

if err != nil {

log.Errorf("Unable to probabilistically sample, failed to determine 128-bit trace ID from incoming span: %v", err)

return false

}

hasher := fnv.New32a()

_, _ = hasher.Write(ps.hashSeed)

_, _ = hasher.Write(tid)

hash := hasher.Sum32()

keep := hash&bitMaskHashBuckets < ps.scaledSamplingPercentage

if keep {

sampled = true

setMetric(root, probRateKey, ps.samplingPercentage)

}

return

}

func get128BitTraceID(span *trace.Span) ([]byte, error) {

// If it's an otel span the whole trace ID is in otel.trace

if tid, ok := span.Meta["otel.trace_id"]; ok {

bs, err := hex.DecodeString(tid)

if err != nil {

return nil, err

}

return bs, nil

}

tid := make([]byte, 16)

binary.BigEndian.PutUint64(tid[8:], span.TraceID)

// Get hex encoded upper bits for datadog spans

// If no value is found we can use the default `0` value as that's what will have been propagated

if upper, ok := span.Meta["_dd.p.tid"]; ok {

u, err := strconv.ParseUint(upper, 16, 64)

if err != nil {

return nil, err

}

binary.BigEndian.PutUint64(tid[:8], u)

}

return tid, nil

}

For info, we now have the IsConfigured function in the config to know if a setting was set by the user or comes from the defaults.

keisku · 2025-02-04T22:32:12Z

/merge

dd-devflow · 2025-02-04T22:32:19Z

Devflow running: `/merge`

View all feedbacks in Devflow UI.

2025-02-04 22:32:19 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 27m.

2025-02-04 22:59:24 UTC ℹ️ MergeQueue: This merge request was merged

github-actions bot added short review PR is simple enough to be reviewed quickly team/opentelemetry OpenTelemetry team labels Jan 30, 2025

keisku changed the title ~~Error Sampler should work for OTLP spans~~ Improve Error sampler for OTLP spans Jan 30, 2025

keisku force-pushed the keisku/APMS-14685-error-sampler branch from a720121 to 6121815 Compare January 30, 2025 16:14

github-actions bot added medium review PR review might take time and removed short review PR is simple enough to be reviewed quickly labels Jan 30, 2025

keisku changed the title ~~Improve Error sampler for OTLP spans~~ Avoid setting ingestion_reason:probabilistic always for OTLP error spans Jan 30, 2025

keisku force-pushed the keisku/APMS-14685-error-sampler branch from 6121815 to ee1c2e6 Compare January 30, 2025 16:20

keisku changed the title ~~Avoid setting ingestion_reason:probabilistic always for OTLP error spans~~ Set ingestion_reason:error instead ofingestion_reason:probabilistic when an OTLP span is sampled by Error Sampler Jan 30, 2025

keisku force-pushed the keisku/APMS-14685-error-sampler branch 4 times, most recently from 9e464b4 to f60d478 Compare January 31, 2025 02:38

avoid setting ingestion_reason:probabilistic always for OTLP error spans

8974346

keisku force-pushed the keisku/APMS-14685-error-sampler branch from f60d478 to 8974346 Compare January 31, 2025 02:38

keisku added the qa/done QA done before merge and regressions are covered by tests label Jan 31, 2025

keisku changed the title ~~Set ingestion_reason:error instead ofingestion_reason:probabilistic when an OTLP span is sampled by Error Sampler~~ Prevent setting a probabilistic decision maker for OTLP spans sampled by the Error Sampler Jan 31, 2025

reno new

aa76f03

keisku marked this pull request as ready for review January 31, 2025 04:45

keisku requested review from a team as code owners January 31, 2025 04:45

keisku requested a review from mx-psi January 31, 2025 04:45

keisku added this to the 7.63.0 milestone Jan 31, 2025

songy23 approved these changes Jan 31, 2025

View reviewed changes

songy23 removed the request for review from mx-psi January 31, 2025 14:07

cswatt approved these changes Jan 31, 2025

View reviewed changes

releasenotes/notes/fix-ingestion-reason-for-otlp-spans-71a29c0ffb8f7ade.yaml Outdated Show resolved Hide resolved

keisku requested review from a team as code owners February 1, 2025 00:33

update docs and comments

21d947b

keisku force-pushed the keisku/APMS-14685-error-sampler branch from e2988ef to 21d947b Compare February 1, 2025 00:41

keisku commented Feb 1, 2025

View reviewed changes

ichinaski approved these changes Feb 3, 2025

View reviewed changes

hush-hush approved these changes Feb 4, 2025

View reviewed changes

songy23 approved these changes Feb 4, 2025

View reviewed changes

dd-mergequeue bot merged commit 305794c into main Feb 4, 2025
235 checks passed

dd-mergequeue bot deleted the keisku/APMS-14685-error-sampler branch February 4, 2025 22:59

github-actions bot modified the milestones: 7.63.0, 7.64.0 Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent setting a probabilistic decision maker for OTLP spans sampled by the Error Sampler #33586

Prevent setting a probabilistic decision maker for OTLP spans sampled by the Error Sampler #33586

keisku commented Jan 30, 2025 •

edited

Loading

agent-platform-auto-pr bot commented Jan 30, 2025 •

edited

Loading

agent-platform-auto-pr bot commented Jan 30, 2025 •

edited

Loading

cit-pr-commenter bot commented Jan 30, 2025 •

edited

Loading

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

songy23 left a comment

keisku Feb 1, 2025 •

edited

Loading

hush-hush Feb 4, 2025

keisku commented Feb 4, 2025

dd-devflow bot commented Feb 4, 2025 •

edited

Loading

	if hasPriority {
	if a.PrioritySampler.Sample(now, pt.TraceChunk, pt.Root, pt.TracerEnv, pt.ClientDroppedP0sWeight) {
	return true, true
	}
	} else if a.NoPrioritySampler.Sample(now, pt.TraceChunk.Spans, pt.Root, pt.TracerEnv) {
	return true, true
	}

	if traceContainsError(pt.TraceChunk.Spans, false) {
	return a.ErrorsSampler.Sample(now, pt.TraceChunk.Spans, pt.Root, pt.TracerEnv), true
	}

	if a.conf.ProbabilisticSamplerEnabled {
	if rare {
	return true, true
	}
	if a.ProbabilisticSampler.Sample(pt.Root) {
	pt.TraceChunk.Tags[tagDecisionMaker] = probabilitySampling
	return true, true
	}
	if traceContainsError(pt.TraceChunk.Spans, false) {
	return a.ErrorsSampler.Sample(now, pt.TraceChunk.Spans, pt.Root, pt.TracerEnv), true
	}
	return false, true
	}

		## If `apm_config.probabilistic_sampler.enabled` is enabled, this config is ignored, `apm_config.probabilistic_sampler.enabled.sampling_percentage`
		## is used instead.

	// Sample a trace given the chunk's root span, returns true if the trace should be kept
	func (ps ProbabilisticSampler) Sample(root trace.Span) (sampled bool) {
	if !ps.enabled {
	return false
	}

	defer func() {
	ps.metrics.record(sampled, newMetricsKey(root.Service, "", nil))
	}()

	tid := make([]byte, 16)
	var err error
	if !ps.fullTraceIDMode {
	binary.BigEndian.PutUint64(tid, root.TraceID)
	} else {
	tid, err = get128BitTraceID(root)
	}
	if err != nil {
	log.Errorf("Unable to probabilistically sample, failed to determine 128-bit trace ID from incoming span: %v", err)
	return false
	}

	hasher := fnv.New32a()
	_, _ = hasher.Write(ps.hashSeed)
	_, _ = hasher.Write(tid)
	hash := hasher.Sum32()
	keep := hash&bitMaskHashBuckets < ps.scaledSamplingPercentage
	if keep {
	sampled = true
	setMetric(root, probRateKey, ps.samplingPercentage)
	}
	return
	}

	func get128BitTraceID(span *trace.Span) ([]byte, error) {
	// If it's an otel span the whole trace ID is in otel.trace
	if tid, ok := span.Meta["otel.trace_id"]; ok {
	bs, err := hex.DecodeString(tid)
	if err != nil {
	return nil, err
	}
	return bs, nil
	}
	tid := make([]byte, 16)
	binary.BigEndian.PutUint64(tid[8:], span.TraceID)
	// Get hex encoded upper bits for datadog spans
	// If no value is found we can use the default `0` value as that's what will have been propagated
	if upper, ok := span.Meta["_dd.p.tid"]; ok {
	u, err := strconv.ParseUint(upper, 16, 64)
	if err != nil {
	return nil, err
	}
	binary.BigEndian.PutUint64(tid[:8], u)
	}
	return tid, nil
	}

Prevent setting a probabilistic decision maker for OTLP spans sampled by the Error Sampler #33586

Prevent setting a probabilistic decision maker for OTLP spans sampled by the Error Sampler #33586

Conversation

keisku commented Jan 30, 2025 • edited Loading

What does this PR do?

Motivation

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr bot commented Jan 30, 2025 • edited Loading

Test changes on VM

agent-platform-auto-pr bot commented Jan 30, 2025 • edited Loading

Uncompressed package size comparison

Decision

cit-pr-commenter bot commented Jan 30, 2025 • edited Loading

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

songy23 left a comment

Choose a reason for hiding this comment

keisku Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

hush-hush Feb 4, 2025

Choose a reason for hiding this comment

keisku commented Feb 4, 2025

dd-devflow bot commented Feb 4, 2025 • edited Loading

Devflow running: /merge

keisku commented Jan 30, 2025 •

edited

Loading

agent-platform-auto-pr bot commented Jan 30, 2025 •

edited

Loading

agent-platform-auto-pr bot commented Jan 30, 2025 •

edited

Loading

cit-pr-commenter bot commented Jan 30, 2025 •

edited

Loading

keisku Feb 1, 2025 •

edited

Loading

dd-devflow bot commented Feb 4, 2025 •

edited

Loading

Devflow running: `/merge`