-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable requesting flares with profiles via Remote Config #31398
Enable requesting flares with profiles via Remote Config #31398
Conversation
Go Package Import DifferencesBaseline: 909942b
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 909942b Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.40 | [-0.38, +1.17] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.34 | [-0.44, +1.12] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +0.31 | [-0.55, +1.17] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | +0.19 | [-2.87, +3.24] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.10 | [-0.37, +0.57] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.07 | [-0.64, +0.78] | 1 | Logs |
➖ | file_tree | memory utilization | +0.04 | [-0.03, +0.10] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.00 | [-0.85, +0.86] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.28, +0.29] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.02, +0.01] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.02 | [-0.66, +0.62] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | -0.05 | [-0.10, +0.01] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | -0.06 | [-0.92, +0.80] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.10 | [-0.97, +0.77] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | -0.23 | [-0.30, -0.16] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | -0.72 | [-0.77, -0.67] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | intake_connections | 10/10 | |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=55285719 --os-family=ubuntu Note: This applies to commit dcc4d07 |
e09362b
to
f0b9197
Compare
f0b9197
to
7357f5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some non-blocking comments and questions, but AML files LGTM.
"runtime_mutex_profile_fraction": settings.NewRuntimeMutexProfileFraction(), | ||
"runtime_block_profile_rate": settings.NewRuntimeBlockProfileRate(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something, but are you sure we need these runtime settings on the flare binary? Runtime settings implementations control the current process (ex), but I would think that we want to set these settings on the agent process, not the flare command process (even though we have local collection, I think for profiles we're only interested in the agent performance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - we need the runtime settings injection itself but including the actual profile settings gives the mistaken impression that they're valid to utilize. Will provide a blank settings entity with a comment, and revisit the profile component to see if there's a better way to scaffold command process invoked flare fillers for future support.
pkg/config/setup/config.go
Outdated
@@ -357,7 +357,10 @@ func InitConfig(config pkgconfigmodel.Setup) { | |||
|
|||
// flare configs | |||
config.BindEnvAndSetDefault("flare_provider_timeout", 10*time.Second) | |||
config.BindEnvAndSetDefault("flare.rc_profiling_runtime", 60*time.Second) | |||
config.BindEnvAndSetDefault("flare.rc_profiling_runtime", 30*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add these to config_template.yaml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last pr conversation with shared-components suggested that undocumented options were fine for these, but I will revisit now that it's actually being used.
7357f5c
to
7258533
Compare
7258533
to
fbc19a0
Compare
Package size comparisonComparison with ancestor Diff per package
Decision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for Processes
fbc19a0
to
162bfd9
Compare
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision |
8b692c1
to
7c34188
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approval for OTel
// is to move the profiling component completely into a flare provider, the existing architecture | ||
// expects an explicit and pre-emptive profiling run before the flare logic is properly called. | ||
func (p profiler) ReadProfileData(seconds int, logFunc func(log string, params ...interface{}) error) (flaretypes.ProfileData, error) { | ||
type agentProfileCollector func(service string) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for my own learning. Why create an internal private type? Same for the pprofGetter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real answer is because I didn't want to significantly alter the ReadProfileData functionality during this change, but I also didn't feel any need to touch those types since
- I could see someone saying they enhanced the readability of the function and
- they have no real impact on functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on the review. So many open tabs on my side 🤣
Great work 🎉
Add a new flare provider that generates profile data. Provider is defaulted to disabled. Optionally enable profiler provider in the RC agent task Shift existing cli flare generation to call the profiler directly
…ection chain to preserve public interface
Refactor various testing structures to minimize their public exposure Clean up fx modules to make dependencies more explicit
9bccad9
to
dcc4d07
Compare
Static quality checks ✅Please find below the results from static quality gates Info
|
/merge |
Devflow running:
|
What does this PR do?
Enables profiling collection for flares generated by way of Remote Config.
To support this, a new profiler component has been added as a flare filler. This component can be optionally disabled in flare runs when the FlareArgs.ProfileDuration value is set to 0, and will transiently set the
runtime_block_profile_rate
andruntime_mutex_profile_fraction
runtime settings to align with FlareArgs.ProfileBlockingRate and FlareArgs.ProfileMutexFraction respectively.The core logic of the profiler is an almost-exact mirror of the profiling functionality previously found in the flare cli subcommand. To minimize shifting architecture, the flare cli command will directly call this functionality exposed in the profiler rather than utilize the profiler's flareFiller logic.
The RC Agent Task, upon detecting a new
enable_profiling
boolean flag, will populate reasonable default profiling args and make them accessible to the profiler. These defaults can be changed via the undocumented config settingsflare.rc_...
Motivation
Describe how to test/QA your changes
Existing CLI flare generation saw some underlying architecture changes, but should be completely unchanged from a functional perspective. Running the
datadog-agent flare
cli command with and without the various profiling options should generate the same output and the same flare contents as before.Remote config flare generation can be prompted by navigating to https://{{ddog_site}}/fleet, selecting the host to generate the flare, navigating to support -> send support ticket, and sending the support ticket with debug mode enabled. There is a somewhat decent automatic test spread confirming that the various permutations of RC task args will populate the flare filler with the correct values, but it is important to test that
datadog-agent flare -p 30
cli commandPossible Drawbacks / Trade-offs
Gathering profiling information in the flare can drastically increase generation times. In some system setups, profiling can currently take up to an additional 5 minutes on the default settings.
Additional Notes