Benchmark and document tail-based sampling performance #11346

axw · 2023-08-07T08:48:15Z

We have a good benchmarking setup for general apm-server ingest performance, but tail-based sampling is a bit of a blind spot. We have done this manually in the past, but we don't have a framework for repeatable testing of TBS.

Once we have a baseline performance established, we should then add to the public documentation. This should include details about disks used, and what kinds of disks are recommended; and expectations about disk and memory usage in relation to ingest rate and sampling rate, and specifically some guidance on setting tail sampling storage limit. Documentation on TBS performance should probably follow on from #7842.

We will need #7845. Assuming we use apmbench, we will need to enable -rewrite-ids to ensure trace.id and per-trace events are not repeated, which would affect TBS.

Note to whoever works on this:

We should look at how both disk reads and writes grow with both event ingest rate and sampling rate. Rate of writes is generally proportional to ingest rate, but rate of reads is expected to be proportional to the ingest * sampling rate.
We should compare Badger v2 (in use at the time of writing this) vs. v4 performance (proposed)

The text was updated successfully, but these errors were encountered:

carsonip · 2024-07-01T09:11:58Z

Adding links for posterity.

We should look at how both disk reads and writes grow with both event ingest rate and sampling rate. Rate of writes is generally proportional to ingest rate, but rate of reads is expected to be proportional to the ingest * sampling rate.

To be more exact, in a multi-apm-server setup, the expectation is that "Rate of writes is generally proportional to local ingest rate", ingest rate local to the apm-server under observation.

On the read side, the expectation is that "rate of reads is proportional to local ingest * sampling rate". However, before fix #13464, apm-server suffers from rate of reads proportional to global ingest * sampling rate, which means unscalable disk IO and memory usage.

We should compare Badger v2 (in use at the time of writing this) vs. v4 performance (proposed)

Related to #11546

axw added docs performance labels Aug 7, 2023

carsonip mentioned this issue Aug 31, 2023

Document TBS common pitfalls #11544

Closed

carsonip mentioned this issue Nov 8, 2023

docs: Update performance guide #11969

Merged

4 tasks

This was referenced Dec 12, 2024

[meta] Tail-based sampling (TBS) improvements #14931

Open

Add option to enable TBS in benchmark terraform #14985

Merged

mergify bot mentioned this issue Dec 17, 2024

[8.x] Add option to enable TBS in benchmark terraform (backport #14985) #14987

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark and document tail-based sampling performance #11346

Benchmark and document tail-based sampling performance #11346

axw commented Aug 7, 2023 •

edited by carsonip

Loading

carsonip commented Jul 1, 2024 •

edited

Loading

Benchmark and document tail-based sampling performance #11346

Benchmark and document tail-based sampling performance #11346

Comments

axw commented Aug 7, 2023 • edited by carsonip Loading

carsonip commented Jul 1, 2024 • edited Loading

axw commented Aug 7, 2023 •

edited by carsonip

Loading

carsonip commented Jul 1, 2024 •

edited

Loading