TBS: Explore replacing badger with pebble #15246

carsonip · 2025-01-14T19:02:07Z

Look into whether it is feasible to replace badger with pebble to consolidate on KV store dependency. Ensure that there is no major performance regression, support existing apm-server features, and identify blockers to adopt pebble for TBS.

carsonip · 2025-01-14T19:17:56Z

Benchmarks

TLDR: After some iterating on optimizations and analyzing some profiles, pebble is on par with badger in benchmark workflow.

Work done so far

Created new branch [tbs-pebble-rebase] which is rebased on main branch, see draft PR: [WIP] TBS: Replace badger with pebble #15235
Use a forked pebble for max batch to reduce allocs
Mem table size and flush threshold tuning
Disable level compression
Enable table bloom filter

badger on apm-server main benchmark workflow

Running benchmarks...
Benchmark warmup time: 5m
Benchmark agents: 512
Benchmark event rate: 0/s
Benchmark count: 6
Benchmark duration: 2m
Benchmark run expression : BenchmarkAgentAll
BenchmarkAgentAll-512	     553	 260541248 ns/op	         0 error_responses/sec	      1382 errors/sec	     21758 events/sec	       907.0 gc_cycles	       707.0 max_goroutines	1261444136 max_heap_alloc	   4682987 max_heap_objects	1433145344 max_rss	        16.51 mean_available_indexers	     10562 metrics/sec	      8970 spans/sec	  35637795 tbs_lsm_size	 392156266 tbs_vlog_size	       844.4 txs/sec	448555310 B/op	 4411221 allocs/op
BenchmarkAgentAll-512	     801	 252201992 ns/op	         0 error_responses/sec	      1427 errors/sec	     22481 events/sec	      1379 gc_cycles	       704.0 max_goroutines	1260306400 max_heap_alloc	   4736696 max_heap_objects	1438785536 max_rss	        16.29 mean_available_indexers	     10915 metrics/sec	      9266 spans/sec	  45601504 tbs_lsm_size	 351887394 tbs_vlog_size	       872.3 txs/sec	453655783 B/op	 4417256 allocs/op
BenchmarkAgentAll-512	     561	 261476961 ns/op	         0 error_responses/sec	      1377 errors/sec	     21688 events/sec	       940.0 gc_cycles	       691.0 max_goroutines	1227936304 max_heap_alloc	   4860484 max_heap_objects	1467277312 max_rss	        16.65 mean_available_indexers	     10533 metrics/sec	      8938 spans/sec	  60765003 tbs_lsm_size	 306052159 tbs_vlog_size	       841.4 txs/sec	449569352 B/op	 4413293 allocs/op
BenchmarkAgentAll-512	     885	 205185009 ns/op	         0 error_responses/sec	      1755 errors/sec	     27635 events/sec	      1546 gc_cycles	       702.0 max_goroutines	1042784816 max_heap_alloc	   4403011 max_heap_objects	1194815488 max_rss	        14.37 mean_available_indexers	     13418 metrics/sec	     11390 spans/sec	  50891789 tbs_lsm_size	 332519163 tbs_vlog_size	      1072 txs/sec	456159205 B/op	 4414346 allocs/op
BenchmarkAgentAll-512	     559	 281483320 ns/op	         0 error_responses/sec	      1279 errors/sec	     20144 events/sec	       905.0 gc_cycles	       703.0 max_goroutines	1276355808 max_heap_alloc	   5081951 max_heap_objects	1498091520 max_rss	        16.95 mean_available_indexers	      9781 metrics/sec	      8302 spans/sec	  36830045 tbs_lsm_size	 266372657 tbs_vlog_size	       781.6 txs/sec	449191625 B/op	 4425107 allocs/op
BenchmarkAgentAll-512	     739	 223036149 ns/op	         0 error_responses/sec	      1614 errors/sec	     25421 events/sec	      1417 gc_cycles	       697.0 max_goroutines	 811746504 max_heap_alloc	   3604364 max_heap_objects	1018437632 max_rss	        15.44 mean_available_indexers	     12343 metrics/sec	     10478 spans/sec	  52188296 tbs_lsm_size	 268349381 tbs_vlog_size	       986.4 txs/sec	458476559 B/op	 4407548 allocs/op
make[1]: Leaving directory '/home/runner/work/apm-server/apm-server/testing/benchmark'

pebble benchmark workflow on commit 1c89735

BenchmarkAgentAll-512	     472	 298160472 ns/op	         0 error_responses/sec	      1207 errors/sec	     19997 events/sec	       601.0 gc_cycles	       757.0 max_goroutines	1388689704 max_heap_alloc	   4322788 max_heap_objects	1505554432 max_rss	        16.94 mean_available_indexers	      9231 metrics/sec	      8764 spans/sec	 155904548 tbs_lsm_size	         0 tbs_vlog_size	       794.9 txs/sec	604305729 B/op	 4400824 allocs/op
BenchmarkAgentAll-512	     499	 309514313 ns/op	         0 error_responses/sec	      1163 errors/sec	     19268 events/sec	       632.0 gc_cycles	       725.0 max_goroutines	1404890096 max_heap_alloc	   4534565 max_heap_objects	1553813504 max_rss	        17.15 mean_available_indexers	      8897 metrics/sec	      8442 spans/sec	 156954754 tbs_lsm_size	         0 tbs_vlog_size	       765.7 txs/sec	613979377 B/op	 4435160 allocs/op
BenchmarkAgentAll-512	     782	 244716316 ns/op	         0 error_responses/sec	      1471 errors/sec	     24376 events/sec	       975.0 gc_cycles	       750.0 max_goroutines	1486198504 max_heap_alloc	   4398963 max_heap_objects	1604104192 max_rss	        15.34 mean_available_indexers	     11259 metrics/sec	     10678 spans/sec	 156741043 tbs_lsm_size	         0 tbs_vlog_size	       968.5 txs/sec	609378909 B/op	 4418804 allocs/op
BenchmarkAgentAll-512	     502	 301330682 ns/op	         0 error_responses/sec	      1195 errors/sec	     19788 events/sec	       631.0 gc_cycles	       728.0 max_goroutines	1323958592 max_heap_alloc	   4258571 max_heap_objects	1499250688 max_rss	        17.13 mean_available_indexers	      9135 metrics/sec	      8672 spans/sec	 156855826 tbs_lsm_size	         0 tbs_vlog_size	       786.5 txs/sec	604224580 B/op	 4408866 allocs/op
BenchmarkAgentAll-512	     636	 238955232 ns/op	         0 error_responses/sec	      1507 errors/sec	     24958 events/sec	       814.0 gc_cycles	       742.0 max_goroutines	1526017728 max_heap_alloc	   4568769 max_heap_objects	1647812608 max_rss	        15.22 mean_available_indexers	     11524 metrics/sec	     10935 spans/sec	 157031321 tbs_lsm_size	         0 tbs_vlog_size	       991.8 txs/sec	604787049 B/op	 4404682 allocs/op
BenchmarkAgentAll-512	     504	 304077702 ns/op	         0 error_responses/sec	      1184 errors/sec	     19612 events/sec	       634.0 gc_cycles	       798.0 max_goroutines	1349592872 max_heap_alloc	   4744070 max_heap_objects	1482018816 max_rss	        17.18 mean_available_indexers	      9056 metrics/sec	      8593 spans/sec	 156933769 tbs_lsm_size	         0 tbs_vlog_size	       779.4 txs/sec	606208797 B/op	 4422900 allocs/op

Benchstat

             │ badger-main.stat │     run-1122-1c89735.stat     │
             │      sec/op      │    sec/op     vs base         │
AgentAll-512       256.4m ± 20%   299.7m ± 20%  ~ (p=0.132 n=6)

             │  badger-main.stat   │         run-1122-1c89735.stat          │
             │ error_responses/sec │ error_responses/sec  vs base           │
AgentAll-512            0.000 ± 0%            0.000 ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

             │ badger-main.stat │     run-1122-1c89735.stat     │
             │    errors/sec    │  errors/sec   vs base         │
AgentAll-512       1.405k ± 25%   1.201k ± 25%  ~ (p=0.132 n=6)

             │ badger-main.stat │     run-1122-1c89735.stat     │
             │    events/sec    │  events/sec   vs base         │
AgentAll-512       22.12k ± 25%   19.89k ± 25%  ~ (p=0.132 n=6)

             │ badger-main.stat │       run-1122-1c89735.stat        │
             │    gc_cycles     │  gc_cycles   vs base               │
AgentAll-512       1159.5 ± 33%   633.0 ± 54%  -45.41% (p=0.015 n=6)

             │ badger-main.stat │        run-1122-1c89735.stat         │
             │  max_goroutines  │ max_goroutines  vs base              │
AgentAll-512         702.5 ± 2%       746.0 ± 7%  +6.19% (p=0.002 n=6)

             │ badger-main.stat │         run-1122-1c89735.stat         │
             │  max_heap_alloc  │ max_heap_alloc  vs base               │
AgentAll-512       1.244G ± 35%      1.397G ± 9%  +12.27% (p=0.002 n=6)

             │ badger-main.stat │       run-1122-1c89735.stat       │
             │ max_heap_objects │ max_heap_objects  vs base         │
AgentAll-512       4.710M ± 23%        4.467M ± 6%  ~ (p=0.310 n=6)

             │ badger-main.stat │       run-1122-1c89735.stat       │
             │     max_rss      │   max_rss    vs base              │
AgentAll-512       1.436G ± 29%   1.530G ± 8%  +6.53% (p=0.004 n=6)

             │    badger-main.stat     │          run-1122-1c89735.stat           │
             │ mean_available_indexers │ mean_available_indexers  vs base         │
AgentAll-512               16.40 ± 12%               17.04 ± 11%  ~ (p=0.310 n=6)

             │ badger-main.stat │     run-1122-1c89735.stat     │
             │   metrics/sec    │ metrics/sec   vs base         │
AgentAll-512      10.739k ± 25%   9.183k ± 25%  ~ (p=0.132 n=6)

             │ badger-main.stat │     run-1122-1c89735.stat     │
             │    spans/sec     │  spans/sec    vs base         │
AgentAll-512       9.118k ± 25%   8.718k ± 25%  ~ (p=0.589 n=6)

             │ badger-main.stat │        run-1122-1c89735.stat         │
             │   tbs_lsm_size   │ tbs_lsm_size  vs base                │
AgentAll-512       48.25M ± 26%   156.89M ± 1%  +225.19% (p=0.002 n=6)

             │ badger-main.stat │         run-1122-1c89735.stat         │
             │  tbs_vlog_size   │ tbs_vlog_size  vs base                │
AgentAll-512       319.3M ± 23%       0.0M ± 0%  -100.00% (p=0.002 n=6)

             │ badger-main.stat │    run-1122-1c89735.stat     │
             │     txs/sec      │   txs/sec    vs base         │
AgentAll-512        858.3 ± 25%   790.7 ± 25%  ~ (p=0.310 n=6)

             │ badger-main.stat │        run-1122-1c89735.stat        │
             │       B/op       │     B/op      vs base               │
AgentAll-512       430.7Mi ± 2%   577.4Mi ± 1%  +34.07% (p=0.002 n=6)

             │ badger-main.stat │    run-1122-1c89735.stat     │
             │    allocs/op     │  allocs/op   vs base         │
AgentAll-512        4.414M ± 0%   4.414M ± 0%  ~ (p=0.937 n=6)

carsonip added the enhancement label Jan 14, 2025

carsonip self-assigned this Jan 14, 2025

This was referenced Jan 14, 2025

[meta] Tail-based sampling (TBS) improvements #14931

Open

Update badger to latest version #11546

Open

carsonip changed the title ~~TBS: Explore using pebble to replace badger~~ TBS: Explore replacing badger with pebble Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TBS: Explore replacing badger with pebble #15246

TBS: Explore replacing badger with pebble #15246

carsonip commented Jan 14, 2025

carsonip commented Jan 14, 2025

TBS: Explore replacing badger with pebble #15246

TBS: Explore replacing badger with pebble #15246

Comments

carsonip commented Jan 14, 2025

carsonip commented Jan 14, 2025

Benchmarks

badger on apm-server main benchmark workflow

pebble benchmark workflow on commit 1c89735

Benchstat