Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable requesting flares with profiles via Remote Config #31398

Merged

Conversation

ddrthall
Copy link
Contributor

@ddrthall ddrthall commented Nov 22, 2024

What does this PR do?

Enables profiling collection for flares generated by way of Remote Config.

To support this, a new profiler component has been added as a flare filler. This component can be optionally disabled in flare runs when the FlareArgs.ProfileDuration value is set to 0, and will transiently set the runtime_block_profile_rate and runtime_mutex_profile_fraction runtime settings to align with FlareArgs.ProfileBlockingRate and FlareArgs.ProfileMutexFraction respectively.

The core logic of the profiler is an almost-exact mirror of the profiling functionality previously found in the flare cli subcommand. To minimize shifting architecture, the flare cli command will directly call this functionality exposed in the profiler rather than utilize the profiler's flareFiller logic.

The RC Agent Task, upon detecting a new enable_profiling boolean flag, will populate reasonable default profiling args and make them accessible to the profiler. These defaults can be changed via the undocumented config settings flare.rc_...

Motivation

Describe how to test/QA your changes

Existing CLI flare generation saw some underlying architecture changes, but should be completely unchanged from a functional perspective. Running the datadog-agent flare cli command with and without the various profiling options should generate the same output and the same flare contents as before.

Remote config flare generation can be prompted by navigating to https://{{ddog_site}}/fleet, selecting the host to generate the flare, navigating to support -> send support ticket, and sending the support ticket with debug mode enabled. There is a somewhat decent automatic test spread confirming that the various permutations of RC task args will populate the flare filler with the correct values, but it is important to test that

  1. The flare generation correctly identifies the agents to profile, such as active system-probe components or active process agents.
  2. Flare generation is within the flare provider timeout threshold. Providers that get timed out generate a log of the format "flare provider '%s' skipped after %s". Providers that finish successfully (whether or not the timeout message appeared) show debug logs of the format "flare provider '%s' completed in %s".
  3. RC flares generated with default settings are identical to flares created with the datadog-agent flare -p 30 cli command
  4. Flares are successfully uploaded to the support ticket requested, which will be visible in the RC UI.
  5. Log output during flare generation, both in the agent.log and in the flare_creation.log, is clear and helpful

Possible Drawbacks / Trade-offs

Gathering profiling information in the flare can drastically increase generation times. In some system setups, profiling can currently take up to an additional 5 minutes on the default settings.

Additional Notes

@github-actions github-actions bot added the [deprecated] team/agent-shared-components Deprecated. Use team/agent-configuration or team/agent-runtimes labels instead. label Nov 22, 2024
@ddrthall ddrthall added this to the 7.61.0 milestone Nov 22, 2024
@github-actions github-actions bot added the long review PR is complex, plan time to review it label Nov 22, 2024
Copy link

cit-pr-commenter bot commented Nov 22, 2024

Go Package Import Differences

Baseline: 909942b
Comparison: dcc4d07

binaryosarchchange
agentlinuxamd64
+3, -0
+github.com/DataDog/datadog-agent/comp/core/profiler/def
+github.com/DataDog/datadog-agent/comp/core/profiler/fx
+github.com/DataDog/datadog-agent/comp/core/profiler/impl
agentlinuxarm64
+3, -0
+github.com/DataDog/datadog-agent/comp/core/profiler/def
+github.com/DataDog/datadog-agent/comp/core/profiler/fx
+github.com/DataDog/datadog-agent/comp/core/profiler/impl
agentwindowsamd64
+3, -0
+github.com/DataDog/datadog-agent/comp/core/profiler/def
+github.com/DataDog/datadog-agent/comp/core/profiler/fx
+github.com/DataDog/datadog-agent/comp/core/profiler/impl
agentdarwinamd64
+3, -0
+github.com/DataDog/datadog-agent/comp/core/profiler/def
+github.com/DataDog/datadog-agent/comp/core/profiler/fx
+github.com/DataDog/datadog-agent/comp/core/profiler/impl
agentdarwinarm64
+3, -0
+github.com/DataDog/datadog-agent/comp/core/profiler/def
+github.com/DataDog/datadog-agent/comp/core/profiler/fx
+github.com/DataDog/datadog-agent/comp/core/profiler/impl
iot-agentlinuxamd64
+3, -0
+github.com/DataDog/datadog-agent/comp/core/profiler/def
+github.com/DataDog/datadog-agent/comp/core/profiler/fx
+github.com/DataDog/datadog-agent/comp/core/profiler/impl
iot-agentlinuxarm64
+3, -0
+github.com/DataDog/datadog-agent/comp/core/profiler/def
+github.com/DataDog/datadog-agent/comp/core/profiler/fx
+github.com/DataDog/datadog-agent/comp/core/profiler/impl
heroku-agentlinuxamd64
+3, -0
+github.com/DataDog/datadog-agent/comp/core/profiler/def
+github.com/DataDog/datadog-agent/comp/core/profiler/fx
+github.com/DataDog/datadog-agent/comp/core/profiler/impl

Copy link

cit-pr-commenter bot commented Nov 22, 2024

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 66de2492-c0b7-4b7e-980d-b5e9138c692e

Baseline: 909942b
Comparison: dcc4d07
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
file_to_blackhole_1000ms_latency egress throughput +0.40 [-0.38, +1.17] 1 Logs
file_to_blackhole_500ms_latency egress throughput +0.34 [-0.44, +1.12] 1 Logs
uds_dogstatsd_to_api_cpu % cpu utilization +0.31 [-0.55, +1.17] 1 Logs
quality_gate_logs % cpu utilization +0.19 [-2.87, +3.24] 1 Logs
file_to_blackhole_1000ms_latency_linear_load egress throughput +0.10 [-0.37, +0.57] 1 Logs
file_to_blackhole_100ms_latency egress throughput +0.07 [-0.64, +0.78] 1 Logs
file_tree memory utilization +0.04 [-0.03, +0.10] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.00 [-0.85, +0.86] 1 Logs
uds_dogstatsd_to_api ingress throughput +0.00 [-0.28, +0.29] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.02, +0.01] 1 Logs
file_to_blackhole_300ms_latency egress throughput -0.02 [-0.66, +0.62] 1 Logs
quality_gate_idle_all_features memory utilization -0.05 [-0.10, +0.01] 1 Logs bounds checks dashboard
file_to_blackhole_0ms_latency_http2 egress throughput -0.06 [-0.92, +0.80] 1 Logs
file_to_blackhole_0ms_latency_http1 egress throughput -0.10 [-0.97, +0.77] 1 Logs
tcp_syslog_to_blackhole ingress throughput -0.23 [-0.30, -0.16] 1 Logs
quality_gate_idle memory utilization -0.72 [-0.77, -0.67] 1 Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed links
file_to_blackhole_0ms_latency lost_bytes 10/10
file_to_blackhole_0ms_latency memory_usage 10/10
file_to_blackhole_0ms_latency_http1 lost_bytes 10/10
file_to_blackhole_0ms_latency_http1 memory_usage 10/10
file_to_blackhole_0ms_latency_http2 lost_bytes 10/10
file_to_blackhole_0ms_latency_http2 memory_usage 10/10
file_to_blackhole_1000ms_latency memory_usage 10/10
file_to_blackhole_1000ms_latency_linear_load memory_usage 10/10
file_to_blackhole_100ms_latency lost_bytes 10/10
file_to_blackhole_100ms_latency memory_usage 10/10
file_to_blackhole_300ms_latency lost_bytes 10/10
file_to_blackhole_300ms_latency memory_usage 10/10
file_to_blackhole_500ms_latency lost_bytes 10/10
file_to_blackhole_500ms_latency memory_usage 10/10
quality_gate_idle intake_connections 10/10 bounds checks dashboard
quality_gate_idle memory_usage 10/10 bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 bounds checks dashboard
quality_gate_logs intake_connections 10/10
quality_gate_logs lost_bytes 10/10
quality_gate_logs memory_usage 10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.

@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Nov 25, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=55285719 --os-family=ubuntu

Note: This applies to commit dcc4d07

@ddrthall ddrthall force-pushed the ryan.hall/enable-requesting-flares-remote-config-2 branch 3 times, most recently from e09362b to f0b9197 Compare November 26, 2024 19:25
@ddrthall ddrthall modified the milestones: 7.61.0, 7.62.0 Dec 2, 2024
@ddrthall ddrthall force-pushed the ryan.hall/enable-requesting-flares-remote-config-2 branch from f0b9197 to 7357f5c Compare December 2, 2024 21:42
@ddrthall ddrthall marked this pull request as ready for review December 3, 2024 14:46
@ddrthall ddrthall requested review from a team as code owners December 3, 2024 14:46
@ddrthall ddrthall requested a review from a team December 3, 2024 14:46
@ddrthall ddrthall requested review from a team as code owners December 3, 2024 14:46
Copy link
Contributor

@vickenty vickenty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some non-blocking comments and questions, but AML files LGTM.

comp/core/flare/flare.go Outdated Show resolved Hide resolved
comp/core/flare/flare.go Outdated Show resolved Hide resolved
comp/core/flare/flare.go Outdated Show resolved Hide resolved
comp/core/flare/flare.go Outdated Show resolved Hide resolved
comp/core/flare/flare.go Outdated Show resolved Hide resolved
Comment on lines 134 to 135
"runtime_mutex_profile_fraction": settings.NewRuntimeMutexProfileFraction(),
"runtime_block_profile_rate": settings.NewRuntimeBlockProfileRate(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but are you sure we need these runtime settings on the flare binary? Runtime settings implementations control the current process (ex), but I would think that we want to set these settings on the agent process, not the flare command process (even though we have local collection, I think for profiles we're only interested in the agent performance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - we need the runtime settings injection itself but including the actual profile settings gives the mistaken impression that they're valid to utilize. Will provide a blank settings entity with a comment, and revisit the profile component to see if there's a better way to scaffold command process invoked flare fillers for future support.

@@ -357,7 +357,10 @@ func InitConfig(config pkgconfigmodel.Setup) {

// flare configs
config.BindEnvAndSetDefault("flare_provider_timeout", 10*time.Second)
config.BindEnvAndSetDefault("flare.rc_profiling_runtime", 60*time.Second)
config.BindEnvAndSetDefault("flare.rc_profiling_runtime", 30*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add these to config_template.yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last pr conversation with shared-components suggested that undocumented options were fine for these, but I will revisit now that it's actually being used.

comp/core/profiler/impl/profiler.go Show resolved Hide resolved
@ddrthall ddrthall force-pushed the ryan.hall/enable-requesting-flares-remote-config-2 branch from 7357f5c to 7258533 Compare December 4, 2024 22:19
@ddrthall ddrthall force-pushed the ryan.hall/enable-requesting-flares-remote-config-2 branch from 7258533 to fbc19a0 Compare December 12, 2024 10:09
@agent-platform-auto-pr
Copy link
Contributor

Package size comparison

Comparison with ancestor c96e64710cdd2f6683e54fe07067b4273291eb98

Diff per package
package diff status size ancestor threshold
datadog-agent-amd64-deb 0.02MB ⚠️ 1266.01MB 1265.99MB 140.00MB
datadog-iot-agent-amd64-deb 0.02MB ⚠️ 113.29MB 113.27MB 10.00MB
datadog-dogstatsd-amd64-deb 0.00MB 78.50MB 78.50MB 10.00MB
datadog-heroku-agent-amd64-deb 0.04MB ⚠️ 526.77MB 526.73MB 70.00MB
datadog-agent-x86_64-rpm 0.02MB ⚠️ 1275.25MB 1275.22MB 140.00MB
datadog-agent-x86_64-suse 0.02MB ⚠️ 1275.25MB 1275.22MB 140.00MB
datadog-iot-agent-x86_64-rpm 0.02MB ⚠️ 113.36MB 113.34MB 10.00MB
datadog-iot-agent-x86_64-suse 0.02MB ⚠️ 113.36MB 113.33MB 10.00MB
datadog-dogstatsd-x86_64-rpm 0.00MB ⚠️ 78.58MB 78.58MB 10.00MB
datadog-dogstatsd-x86_64-suse 0.00MB ⚠️ 78.58MB 78.58MB 10.00MB
datadog-agent-arm64-deb 0.02MB ⚠️ 1001.00MB 1000.98MB 140.00MB
datadog-iot-agent-arm64-deb 0.03MB ⚠️ 108.77MB 108.75MB 10.00MB
datadog-dogstatsd-arm64-deb -0.00MB 55.73MB 55.73MB 10.00MB
datadog-agent-aarch64-rpm 0.02MB ⚠️ 1010.22MB 1010.20MB 140.00MB
datadog-iot-agent-aarch64-rpm 0.03MB ⚠️ 108.84MB 108.82MB 10.00MB

Decision

⚠️ Warning

Copy link
Member

@robertjli robertjli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for Processes

@ddrthall ddrthall force-pushed the ryan.hall/enable-requesting-flares-remote-config-2 branch from fbc19a0 to 162bfd9 Compare December 16, 2024 16:00
@ddrthall ddrthall requested a review from a team as a code owner December 16, 2024 16:00
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Dec 16, 2024

Uncompressed package size comparison

Comparison with ancestor 909942b64fda2a3bdb7f79562115cde9e3cacd94

Diff per package
package diff status size ancestor threshold
datadog-heroku-agent-amd64-deb 0.04MB ⚠️ 445.94MB 445.90MB 0.50MB
datadog-iot-agent-x86_64-rpm 0.03MB ⚠️ 86.47MB 86.44MB 0.50MB
datadog-iot-agent-x86_64-suse 0.03MB ⚠️ 86.46MB 86.44MB 0.50MB
datadog-iot-agent-amd64-deb 0.02MB ⚠️ 86.40MB 86.37MB 0.50MB
datadog-iot-agent-aarch64-rpm 0.02MB ⚠️ 82.73MB 82.71MB 0.50MB
datadog-iot-agent-arm64-deb 0.02MB ⚠️ 82.66MB 82.64MB 0.50MB
datadog-agent-arm64-deb 0.02MB ⚠️ 861.59MB 861.58MB 0.50MB
datadog-agent-aarch64-rpm 0.02MB ⚠️ 871.31MB 871.30MB 0.50MB
datadog-agent-x86_64-rpm 0.02MB ⚠️ 883.46MB 883.44MB 0.50MB
datadog-agent-x86_64-suse 0.02MB ⚠️ 883.46MB 883.44MB 0.50MB
datadog-agent-amd64-deb 0.02MB ⚠️ 873.72MB 873.70MB 0.50MB
datadog-dogstatsd-arm64-deb 0.00MB 56.58MB 56.57MB 0.50MB
datadog-dogstatsd-amd64-deb 0.00MB 59.10MB 59.10MB 0.50MB
datadog-dogstatsd-x86_64-rpm -0.00MB 59.18MB 59.18MB 0.50MB
datadog-dogstatsd-x86_64-suse -0.00MB 59.18MB 59.18MB 0.50MB

Decision

⚠️ Warning

@ddrthall ddrthall removed this from the 7.62.0 milestone Jan 2, 2025
@ddrthall ddrthall requested a review from a team as a code owner January 30, 2025 20:00
@ddrthall ddrthall force-pushed the ryan.hall/enable-requesting-flares-remote-config-2 branch from 8b692c1 to 7c34188 Compare January 31, 2025 17:56
@ddrthall ddrthall requested review from a team as code owners January 31, 2025 17:56
Copy link
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approval for OTel

@songy23 songy23 removed the request for review from dinooliva January 31, 2025 18:09
// is to move the profiling component completely into a flare provider, the existing architecture
// expects an explicit and pre-emptive profiling run before the flare logic is properly called.
func (p profiler) ReadProfileData(seconds int, logFunc func(log string, params ...interface{}) error) (flaretypes.ProfileData, error) {
type agentProfileCollector func(service string) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for my own learning. Why create an internal private type? Same for the pprofGetter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real answer is because I didn't want to significantly alter the ReadProfileData functionality during this change, but I also didn't feel any need to touch those types since

  1. I could see someone saying they enhanced the readability of the function and
  2. they have no real impact on functionality.

Copy link
Member

@GustavoCaso GustavoCaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on the review. So many open tabs on my side 🤣
Great work 🎉

Add a new flare provider that generates profile data. Provider is defaulted to disabled.

Optionally enable profiler provider in the RC agent task

Shift existing cli flare generation to call the profiler directly
Refactor various testing structures to minimize their public exposure
Clean up fx modules to make dependencies more explicit
@ddrthall ddrthall force-pushed the ryan.hall/enable-requesting-flares-remote-config-2 branch from 9bccad9 to dcc4d07 Compare February 7, 2025 18:44
@ddrthall ddrthall requested a review from a team as a code owner February 7, 2025 18:44
@agent-platform-auto-pr
Copy link
Contributor

Static quality checks ✅

Please find below the results from static quality gates

Info

Result Quality gate On disk size On disk size limit On wire size On wire size limit
static_quality_gate_agent_deb_amd64 845.07MiB 858.45MiB 203.58MiB 214.3MiB
static_quality_gate_docker_agent_amd64 929.38MiB 942.69MiB 310.73MiB 321.56MiB

@ddrthall ddrthall added the qa/done QA done before merge and regressions are covered by tests label Feb 12, 2025
@ddrthall
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Feb 12, 2025

Devflow running: /merge

View all feedbacks in Devflow UI.


2025-02-12 00:52:08 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 30m.


2025-02-12 01:32:20 UTC ℹ️ MergeQueue: This merge request was merged

@dd-mergequeue dd-mergequeue bot merged commit e007f2d into main Feb 12, 2025
245 of 246 checks passed
@dd-mergequeue dd-mergequeue bot deleted the ryan.hall/enable-requesting-flares-remote-config-2 branch February 12, 2025 01:32
@github-actions github-actions bot modified the milestones: 7.63.0, 7.64.0 Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[deprecated] team/agent-shared-components Deprecated. Use team/agent-configuration or team/agent-runtimes labels instead. long review PR is complex, plan time to review it qa/done QA done before merge and regressions are covered by tests team/agent-configuration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants