Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[PROF-10201] Reduce allocation profiling overhead by using coarse tim…
…estamps **What does this PR do?** This PR reduces the allocation profiling overhead (or: optimizes allocation profiling :D ) by using coarse timestamps on the `on_newobj_event` hot path. **Motivation:** When allocation profiling is enabled, the profiler gets called for almost every object allocated in the Ruby VM. Thus, this code path is extremely sensitive: the less work we can do before we return control over to Ruby, the less impact allocation profiling has on the application. The dynamic sampling rate mechanism we employ takes the current timestamp as an input, to decide if "enough time" has elapsed since it last readjusted itself. But "enough time" right now is *one second* and thus we can get away with using `CLOCK_MONOTONIC_COARSE` on Linux which is noticeably cheaper than the regular `CLOCK_MONOTONIC`. **Additional Notes:** Enabling the use of the monotonic clock _sometimes_ on the discrete dynamic sampler required it to "spill" some of its guts out to the caller, so that the caller could correctly use the coarse clock on the hot path. **How to test the change?** Here's my experiment to compare three different clock sources I evaluated: ```c++ // Build with // `g++ time_sources.cpp -o time_sources -lbenchmark -lpthread` // where benchmark is <https://github.com/google/benchmark> aka // `apt install libbenchmark1 libbenchmark-dev` on ubuntu/debian #include <benchmark/benchmark.h> #include <x86intrin.h> // For __rdtsc #include <ctime> // For clock_gettime static void BM_RDTSC(benchmark::State& state) { for (auto _ : state) { benchmark::DoNotOptimize(__rdtsc()); } } static void BM_ClockMonotonic(benchmark::State& state) { timespec ts; for (auto _ : state) { clock_gettime(CLOCK_MONOTONIC, &ts); benchmark::DoNotOptimize(ts); } } static void BM_ClockMonotonicCoarse(benchmark::State& state) { timespec ts; for (auto _ : state) { clock_gettime(CLOCK_MONOTONIC_COARSE, &ts); benchmark::DoNotOptimize(ts); } } BENCHMARK(BM_RDTSC); BENCHMARK(BM_ClockMonotonic); BENCHMARK(BM_ClockMonotonicCoarse); BENCHMARK_MAIN(); ``` Results on my machine: ``` ./time_sources --benchmark_repetitions=10 --benchmark_report_aggregates_only=true 2024-07-19T10:48:20+01:00 Running ./time_sources Run on (20 X 4900 MHz CPU s) CPU Caches: L1 Data 48 KiB (x10) L1 Instruction 32 KiB (x10) L2 Unified 1280 KiB (x10) L3 Unified 24576 KiB (x1) Load Average: 1.23, 1.30, 1.11 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_RDTSC_mean 5.52 ns 5.52 ns 10 BM_RDTSC_median 5.44 ns 5.44 ns 10 BM_RDTSC_stddev 0.148 ns 0.147 ns 10 BM_RDTSC_cv 2.67 % 2.67 % 10 BM_ClockMonotonic_mean 15.8 ns 15.8 ns 10 BM_ClockMonotonic_median 15.4 ns 15.4 ns 10 BM_ClockMonotonic_stddev 1.07 ns 1.07 ns 10 BM_ClockMonotonic_cv 6.77 % 6.77 % 10 BM_ClockMonotonicCoarse_mean 5.92 ns 5.92 ns 10 BM_ClockMonotonicCoarse_median 5.93 ns 5.93 ns 10 BM_ClockMonotonicCoarse_stddev 0.041 ns 0.041 ns 10 BM_ClockMonotonicCoarse_cv 0.68 % 0.68 % 10 ``` and here's the result of running `benchmarks/profiler_allocation.rb` comparing master to this branch: ``` ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux] Warming up -------------------------------------- Allocations (baseline) 1.431M i/100ms Calculating ------------------------------------- Allocations (baseline) 14.370M (± 2.0%) i/s - 144.541M in 10.062635s Warming up -------------------------------------- Allocations (master) 1.014M i/100ms Calculating ------------------------------------- Allocations (master) 10.165M (± 1.0%) i/s - 102.390M in 10.074151s Warming up -------------------------------------- Allocations (coarse) 1.179M i/100ms Calculating ------------------------------------- Allocations (coarse) 11.495M (± 2.5%) i/s - 115.573M in 10.059971s Comparison: Allocations (baseline): 14369960.1 i/s Allocations (coarse): 11495418.2 i/s - 1.25x slower Allocations (master): 10164615.7 i/s - 1.41x slower ``` I've specifically used Ruby 2.7 for this comparison since this benchmark had a lot more variance (including baseline) on latter Rubies, and I wanted to isolate the specific changes to this code path.
- Loading branch information