Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsampling] Late span age histogram should include sampled traces #37180

Merged
merged 7 commits into from
Jan 16, 2025

Conversation

portertech
Copy link
Contributor

@portertech portertech commented Jan 13, 2025

Currently, the late span age histogram only includes non-sampled traces. The histogram should also include sampled traces. This pull request includes anything with a decision time.

Note: The decision cache still wrecks it 😅

@portertech portertech requested review from jpkrohling and a team as code owners January 13, 2025 22:34
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Jan 13, 2025
Signed-off-by: Sean Porter <[email protected]>
Copy link
Contributor

@fatsheep9146 fatsheep9146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a unit test for this fix?

@portertech
Copy link
Contributor Author

Is it possible to add a unit test for this fix?

I'll give it a go 👍

@portertech
Copy link
Contributor Author

@fatsheep9146 I adjusted the existing histogram test to touch sampled traces 👍

I confirmed the test fails on main, only half of the late spans (not sampled) are counted:

?       github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/telemetry    [no test files]
--- FAIL: TestProcessorTailSamplingSamplingLateSpanAge (0.00s)
    processor_telemetry_test.go:519: [Metrics Data not equal: Histogram not equal: Histogram DataPoints not equal:
        missing expected values:
        metricdata.HistogramDataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:interface {}(nil)}}, StartTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Time:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Count:0xa, Bounds:[]float64{0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000}, BucketCounts:[]uint64{0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, Min:metricdata.Extrema[int64]{value:0, valid:true}, Max:metricdata.Extrema[int64]{value:0, valid:true}, Sum:0, Exemplars:[]metricdata.Exemplar[int64](nil)}
        unexpected additional values:
        metricdata.HistogramDataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[0]attribute.KeyValue{}}}, StartTime:time.Date(2025, time.January, 13, 22, 55, 4, 84030000, time.Local), Time:time.Date(2025, time.January, 13, 22, 55, 4, 84311000, time.Local), Count:0x5, Bounds:[]float64{0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000}, BucketCounts:[]uint64{0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, Min:metricdata.Extrema[int64]{value:0, valid:true}, Max:metricdata.Extrema[int64]{value:0, valid:true}, Sum:0, Exemplars:[]metricdata.Exemplar[int64]{}}
        ]

@portertech
Copy link
Contributor Author

Looks like an unrelated lint error.

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, but I really, really think the tail-sampling processor is due to a refactor. "Late span" used to refer to spans that arrived after a sampling decision had been made and dropped, so that people could see why traces are incomplete. With the decision caches, the whole idea of late spans need to be reconsidered. Is it still valuable to track it? Why? If so, it should be better documented.

@fatsheep9146 fatsheep9146 added the ready to merge Code review completed; ready to merge by maintainers label Jan 16, 2025
@portertech
Copy link
Contributor Author

This LGTM, but I really, really think the tail-sampling processor is due to a refactor. "Late span" used to refer to spans that arrived after a sampling decision had been made and dropped, so that people could see why traces are incomplete. With the decision caches, the whole idea of late spans need to be reconsidered. Is it still valuable to track it? Why? If so, it should be better documented.

I agree, it is due for a refactor. This PR, along with #37212 and #37189 are immediate improvements in preparation for such a refactor. Next, we can require the decision caches (breaking change) and optimize the current architecture (e.g. batcher and decision tick loop). It's a very small leap which will result in considerable resource savings. From there, we can gather enough data to direct more significant rework.

@songy23 songy23 merged commit be26a2a into open-telemetry:main Jan 16, 2025
175 checks passed
@github-actions github-actions bot added this to the next release milestone Jan 16, 2025
chengchuanpeng pushed a commit to chengchuanpeng/opentelemetry-collector-contrib that referenced this pull request Jan 26, 2025
…ed traces (open-telemetry#37180)

Currently, the [late span age
histogram](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/981e9f5cc58149ea37c8909a4a6eb30faee96351/processor/tailsamplingprocessor/internal/metadata/generated_telemetry.go#L106-L110)
only includes non-sampled traces. The histogram should also include
sampled traces. This pull request includes anything with a decision
time.

_Note: The decision cache still wrecks it 😅_

---------

Signed-off-by: Sean Porter <[email protected]>
Co-authored-by: Juraci Paixão Kröhling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor ready to merge Code review completed; ready to merge by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants