-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(datadog_logs sink): abort serialization and split batch when payload is too large #19189
Conversation
Signed-off-by: Luke Steensen <[email protected]>
Signed-off-by: Luke Steensen <[email protected]>
/ci-run-regression |
Datadog ReportBranch report: ✅ |
Regression Detector ResultsRun ID: a4be049c-136a-49c3-945a-872c828d59fe ExplanationA regression test is an integrated performance test for Because a target's optimization goal performance in each experiment will vary somewhat each time it is run, we can only estimate mean differences in optimization goal relative to the baseline target. We express these differences as a percentage change relative to the baseline target, denoted "Δ mean %". These estimates are made to a precision that balances accuracy and cost control. We represent this precision as a 90.00% confidence interval denoted "Δ mean % CI": there is a 90.00% chance that the true value of "Δ mean %" is in that interval. We decide whether a change in performance is a "regression" -- a change worth investigating further -- if both of the following two criteria are true:
The table below, if present, lists those experiments that have experienced a statistically significant change in mean optimization goal performance between baseline and comparison SHAs with 90.00% confidence OR have been detected as newly erratic. Negative values of "Δ mean %" mean that baseline is faster, whereas positive values of "Δ mean %" mean that comparison is faster. Results that do not exhibit more than a ±5.00% change in their mean optimization goal are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting change is observed. Changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%:
Fine details of change detection per experiment.
|
Interesting that there was a significant improvement here. |
There is a small optimization here, where we use the already-calculated estimated JSON size of the batch to size the buffer we're about to serialize into. Previously we started with an empty |
Signed-off-by: Luke Steensen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Luke Steensen <[email protected]>
Datadog ReportBranch report: ✅ |
Co-authored-by: Doug Smith <[email protected]>
Signed-off-by: Luke Steensen <[email protected]>
Signed-off-by: Luke Steensen <[email protected]>
Signed-off-by: Luke Steensen <[email protected]>
Datadog ReportBranch report: ✅ 0 Failed, 2092 Passed, 0 Skipped, 1m 23.67s Wall Time |
Co-authored-by: neuronull <[email protected]>
Signed-off-by: Luke Steensen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let a suggestion for simplifying the approach. Let me know what you think.
Signed-off-by: Luke Steensen <[email protected]>
/ci-run-regression |
Regression Detector ResultsRun ID: 12eeed53-6b28-4783-aaeb-74dc7a332129 Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI |
---|---|---|---|---|
➖ | datadog_agent_remap_datadog_logs_acks | ingress throughput | +2.99 | [+2.90, +3.08] |
➖ | syslog_log2metric_humio_metrics | ingress throughput | +2.58 | [+2.38, +2.78] |
➖ | datadog_agent_remap_datadog_logs | ingress throughput | +1.99 | [+1.88, +2.09] |
➖ | otlp_http_to_blackhole | ingress throughput | +1.63 | [+1.47, +1.79] |
➖ | otlp_grpc_to_blackhole | ingress throughput | +1.53 | [+1.43, +1.63] |
➖ | splunk_hec_route_s3 | ingress throughput | +0.79 | [+0.26, +1.32] |
➖ | datadog_agent_remap_blackhole | ingress throughput | +0.74 | [+0.62, +0.85] |
➖ | syslog_regex_logs2metric_ddmetrics | ingress throughput | +0.68 | [+0.55, +0.81] |
➖ | http_text_to_http_json | ingress throughput | +0.56 | [+0.44, +0.68] |
➖ | syslog_splunk_hec_logs | ingress throughput | +0.51 | [+0.43, +0.59] |
➖ | fluent_elasticsearch | ingress throughput | +0.29 | [-0.19, +0.77] |
➖ | syslog_log2metric_tag_cardinality_limit_blackhole | ingress throughput | +0.19 | [+0.06, +0.32] |
➖ | http_to_http_noack | ingress throughput | +0.12 | [+0.02, +0.21] |
➖ | http_to_http_json | ingress throughput | +0.05 | [-0.03, +0.12] |
➖ | splunk_hec_to_splunk_hec_logs_acks | ingress throughput | +0.00 | [-0.14, +0.14] |
➖ | splunk_hec_indexer_ack_blackhole | ingress throughput | -0.00 | [-0.14, +0.14] |
➖ | socket_to_socket_blackhole | ingress throughput | -0.01 | [-0.09, +0.08] |
➖ | http_to_s3 | ingress throughput | -0.03 | [-0.31, +0.24] |
➖ | splunk_hec_to_splunk_hec_logs_noack | ingress throughput | -0.05 | [-0.17, +0.06] |
➖ | enterprise_http_to_http | ingress throughput | -0.08 | [-0.15, -0.01] |
➖ | datadog_agent_remap_blackhole_acks | ingress throughput | -0.23 | [-0.33, -0.13] |
➖ | http_to_http_acks | ingress throughput | -0.51 | [-1.82, +0.80] |
➖ | syslog_log2metric_splunk_hec_metrics | ingress throughput | -0.62 | [-0.77, -0.48] |
➖ | http_elasticsearch | ingress throughput | -1.08 | [-1.15, -1.01] |
➖ | syslog_loki | ingress throughput | -1.68 | [-1.74, -1.62] |
➖ | file_to_blackhole | egress throughput | -2.21 | [-4.72, +0.29] |
➖ | syslog_humio_logs | ingress throughput | -2.55 | [-2.64, -2.46] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Signed-off-by: Luke Steensen <[email protected]>
/ci-run-regression |
Regression Detector ResultsRun ID: 35b4afb5-7c45-48f3-b303-b294dbcf04f0 Performance changes are noted in the perf column of each table:
Significant changes in experiment optimization goalsConfidence level: 90.00%
|
perf | experiment | goal | Δ mean % | Δ mean % CI |
---|---|---|---|---|
✅ | datadog_agent_remap_datadog_logs_acks | ingress throughput | +8.57 | [+8.48, +8.66] |
✅ | datadog_agent_remap_datadog_logs | ingress throughput | +8.32 | [+8.21, +8.43] |
➖ | otlp_http_to_blackhole | ingress throughput | +3.16 | [+2.99, +3.33] |
➖ | http_to_http_acks | ingress throughput | +2.01 | [+0.69, +3.33] |
➖ | datadog_agent_remap_blackhole | ingress throughput | +0.82 | [+0.72, +0.93] |
➖ | syslog_humio_logs | ingress throughput | +0.81 | [+0.71, +0.90] |
➖ | syslog_loki | ingress throughput | +0.65 | [+0.61, +0.69] |
➖ | syslog_log2metric_humio_metrics | ingress throughput | +0.16 | [+0.01, +0.30] |
➖ | http_to_http_noack | ingress throughput | +0.14 | [+0.06, +0.23] |
➖ | datadog_agent_remap_blackhole_acks | ingress throughput | +0.09 | [-0.02, +0.19] |
➖ | file_to_blackhole | egress throughput | +0.08 | [-2.53, +2.69] |
➖ | fluent_elasticsearch | ingress throughput | +0.06 | [-0.42, +0.54] |
➖ | http_to_http_json | ingress throughput | +0.04 | [-0.04, +0.11] |
➖ | splunk_hec_indexer_ack_blackhole | ingress throughput | +0.00 | [-0.14, +0.15] |
➖ | splunk_hec_to_splunk_hec_logs_acks | ingress throughput | +0.00 | [-0.14, +0.14] |
➖ | http_text_to_http_json | ingress throughput | -0.02 | [-0.16, +0.12] |
➖ | http_to_s3 | ingress throughput | -0.05 | [-0.33, +0.23] |
➖ | splunk_hec_to_splunk_hec_logs_noack | ingress throughput | -0.06 | [-0.17, +0.06] |
➖ | enterprise_http_to_http | ingress throughput | -0.13 | [-0.21, -0.06] |
➖ | splunk_hec_route_s3 | ingress throughput | -0.32 | [-0.84, +0.19] |
➖ | syslog_log2metric_tag_cardinality_limit_blackhole | ingress throughput | -0.55 | [-0.66, -0.43] |
➖ | syslog_splunk_hec_logs | ingress throughput | -0.84 | [-0.91, -0.77] |
➖ | syslog_regex_logs2metric_ddmetrics | ingress throughput | -0.90 | [-1.00, -0.80] |
➖ | otlp_grpc_to_blackhole | ingress throughput | -0.96 | [-1.06, -0.86] |
➖ | syslog_log2metric_splunk_hec_metrics | ingress throughput | -1.07 | [-1.21, -0.93] |
➖ | http_elasticsearch | ingress throughput | -1.81 | [-1.88, -1.75] |
➖ | socket_to_socket_blackhole | ingress throughput | -3.64 | [-3.72, -3.56] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Regression Detector ResultsRun ID: 7ac270d0-02c7-4254-9e56-617405b89a0b Performance changes are noted in the perf column of each table:
Significant changes in experiment optimization goalsConfidence level: 90.00%
|
perf | experiment | goal | Δ mean % | Δ mean % CI |
---|---|---|---|---|
✅ | datadog_agent_remap_datadog_logs_acks | ingress throughput | +7.12 | [+7.03, +7.21] |
✅ | datadog_agent_remap_datadog_logs | ingress throughput | +7.04 | [+6.95, +7.13] |
➖ | syslog_log2metric_splunk_hec_metrics | ingress throughput | +0.76 | [+0.60, +0.91] |
➖ | syslog_log2metric_humio_metrics | ingress throughput | +0.56 | [+0.45, +0.66] |
➖ | http_to_s3 | ingress throughput | +0.26 | [-0.02, +0.54] |
➖ | otlp_grpc_to_blackhole | ingress throughput | +0.20 | [+0.11, +0.29] |
➖ | http_to_http_noack | ingress throughput | +0.14 | [+0.06, +0.23] |
➖ | syslog_humio_logs | ingress throughput | +0.14 | [+0.03, +0.25] |
➖ | http_to_http_json | ingress throughput | +0.02 | [-0.06, +0.09] |
➖ | syslog_splunk_hec_logs | ingress throughput | +0.02 | [-0.05, +0.09] |
➖ | splunk_hec_indexer_ack_blackhole | ingress throughput | +0.00 | [-0.14, +0.14] |
➖ | splunk_hec_to_splunk_hec_logs_acks | ingress throughput | -0.00 | [-0.16, +0.15] |
➖ | http_text_to_http_json | ingress throughput | -0.03 | [-0.17, +0.11] |
➖ | splunk_hec_to_splunk_hec_logs_noack | ingress throughput | -0.07 | [-0.19, +0.04] |
➖ | enterprise_http_to_http | ingress throughput | -0.08 | [-0.16, -0.01] |
➖ | syslog_log2metric_tag_cardinality_limit_blackhole | ingress throughput | -0.10 | [-0.21, +0.00] |
➖ | http_elasticsearch | ingress throughput | -0.16 | [-0.23, -0.09] |
➖ | fluent_elasticsearch | ingress throughput | -0.19 | [-0.65, +0.28] |
➖ | datadog_agent_remap_blackhole | ingress throughput | -0.32 | [-0.44, -0.21] |
➖ | otlp_http_to_blackhole | ingress throughput | -0.35 | [-0.51, -0.19] |
➖ | datadog_agent_remap_blackhole_acks | ingress throughput | -0.57 | [-0.67, -0.48] |
➖ | syslog_loki | ingress throughput | -0.82 | [-0.88, -0.76] |
➖ | http_to_http_acks | ingress throughput | -1.09 | [-2.39, +0.21] |
➖ | file_to_blackhole | egress throughput | -1.36 | [-3.79, +1.08] |
➖ | splunk_hec_route_s3 | ingress throughput | -1.63 | [-2.11, -1.14] |
➖ | syslog_regex_logs2metric_ddmetrics | ingress throughput | -2.30 | [-2.42, -2.19] |
➖ | socket_to_socket_blackhole | ingress throughput | -2.75 | [-2.82, -2.67] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Closes #9202 Since #19189 was merged, this heuristic to try to avoid oversized requests is no longer necessary from a correctness point of view. The only potential reason to keep it would be if we expected oversized batches to be common, which could mean a performance impact if the new batch-splitting code is triggered more often to avoid oversized requests. Another option would be to simply reduce buffer we leave ourselves between the goal and the max, but any analysis of the best value would be entirely dependent on the format of the event data. Signed-off-by: Luke Steensen <[email protected]>
…oad is too large (vectordotdev#19189) * add failing test Signed-off-by: Luke Steensen <[email protected]> * abort and split batch serialization when too large Signed-off-by: Luke Steensen <[email protected]> * clippy Signed-off-by: Luke Steensen <[email protected]> * Update src/sinks/datadog/logs/sink.rs Co-authored-by: Doug Smith <[email protected]> * do not double count byte size when splitting Signed-off-by: Luke Steensen <[email protected]> * emit dropped event Signed-off-by: Luke Steensen <[email protected]> * add changelog entry Signed-off-by: Luke Steensen <[email protected]> * Update changelog.d/OPW-86.fix.md Co-authored-by: neuronull <[email protected]> * rename changelog fragment Signed-off-by: Luke Steensen <[email protected]> * use dougs idea Signed-off-by: Luke Steensen <[email protected]> * remove unnecessary clone Signed-off-by: Luke Steensen <[email protected]> --------- Signed-off-by: Luke Steensen <[email protected]> Co-authored-by: Doug Smith <[email protected]> Co-authored-by: neuronull <[email protected]>
Ref #10020