[Security Solution] Extend Detection Engine Health API with top N rules by metrics #181169

maximpn · 2024-04-18T14:59:18Z

Relates to: #125642

Summary

This PR extends Detection Engine Health API by adding top N (by default 10) rules grouped by metrics like execution duration or schedule delay.

Details

This PR is part of my OnWeek! project to investigate possible usage of LLM models for example ChatGPT provided by OpenAI to perform automatic rule monitoring by summarising problems in Detection Engine Health API responses and giving users instructions and advices to solve the problems.

Extending Detection Engine Health API by top N rules is beneficial on its own since it allows to easily spot problematic rules and investigate further manually. It could be super helpful while working on SDH.

The following API endpoints were extended

Cluster health API endpoint /internal/detection_engine/health/_cluster
Space health API endpoint /internal/detection_engine/health/_space

A number of extracted top N rules is controlled by num_of_top_rules body param. A default value is 10 rules.
It's possible to set this param only by using a HTTP POST request (similar behavior for interval). When a HTTP GET request is used alway maximum of 10 top rules will be returned for each metric.

The following metrics were added to show top N rules for each of them (measured in milliseconds)

Execution duration
Schedule delay
Search duration
Indexing duration
Enrichment duration

The following response parts were extended by added a section under top_rules key

Health stats over the specified "health interval" (stats_over_interval)
Health history over the specified "health interval" (history_over_interval)

Response example

Cluster health response (truncated)

{
    "timings": {
        "requested_at": "2024-04-18T14:55:43.075Z",
        "processed_at": "2024-04-18T14:55:44.027Z",
        "processing_time_ms": 952
    },
    ...
    "health": {
        ...
        "stats_over_interval": {
            "top_rules": {
                "by_execution_duration_ms": [
                    {
                        "id": "4a47dcad-08a7-4ef7-89ae-0a0c8a8efdc5",
                        "name": "Cobalt Strike Command and Control Beacon",
                        "category": "siem.queryRule",
                        "percentiles": {
                            "50.0": 303,
                            "95.0": 241134.59999999974,
                            "99.0": 323691.71999999986,
                            "99.9": 342267.0720000002
                        }
                    },
                   ...
                ],
                "by_schedule_delay_ms": [
                    {
                        "id": "4cf4a486-cfae-49fc-968b-5f60ea84c228",
                        "name": "Machine Learning Detected DGA activity using a known SUNBURST DNS domain",
                        "category": "siem.queryRule",
                        "percentiles": {
                            "50.0": 25265,
                            "95.0": 741284.1999999995,
                            "99.0": 881731.2399999998,
                            "99.9": 913331.8240000004
                        }
                    },
                    ...
                ],
                "by_search_duration_ms": [
                    {
                        "id": "e6447ef8-fdb9-41f6-9ddb-2f60213f6c15",
                        "name": "Agent Spoofing - Multiple Hosts Using Same Agent",
                        "category": "siem.thresholdRule",
                        "percentiles": {
                            "50.0": 11,
                            "95.0": 30.599999999999994,
                            "99.0": 32.519999999999996,
                            "99.9": 32.952000000000005
                        }
                    },
                   ...
                ],
                "by_indexing_duration_ms": [
                    {
                        "id": "00064efe-c3a3-449d-b1e4-db2fe1263a55",
                        "name": "Suspicious ScreenConnect Client Child Process",
                        "category": "siem.eqlRule",
                        "percentiles": {
                            "50.0": 0,
                            "95.0": 0,
                            "99.0": 0,
                            "99.9": 0
                        }
                    },
                    ...
                ],
                "by_enrichment_duration_ms": [
                    {
                        "id": "00064efe-c3a3-449d-b1e4-db2fe1263a55",
                        "name": "Suspicious ScreenConnect Client Child Process",
                        "category": "siem.eqlRule",
                        "percentiles": {
                            "50.0": 0,
                            "95.0": 0,
                            "99.0": 0,
                            "99.9": 0
                        }
                    },
                    ...
                ]
            },
            ...
        },
        "history_over_interval": {
            "buckets": [
                {
                    "timestamp": "2024-04-18T13:00:00.000Z",
                    "stats": {
                        "top_rules": {
                            "by_execution_duration_ms": [
                                {
                                    "id": "1cccaa08-2d35-40bf-a2c3-99b41de7e6b7",
                                    "name": "Container Workload Protection",
                                    "category": "siem.queryRule",
                                    "percentiles": {
                                        "50.0": 2595,
                                        "95.0": 2595,
                                        "99.0": 2595,
                                        "99.9": 2595
                                    }
                                },
                                ...
                            ],
                            "by_schedule_delay_ms": [
                                {
                                    "id": "db627248-3395-4bc1-85d6-dae20401fc49",
                                    "name": "Endpoint Security",
                                    "category": "siem.queryRule",
                                    "percentiles": {
                                        "50.0": 1664,
                                        "95.0": 1664,
                                        "99.0": 1664,
                                        "99.9": 1664
                                    }
                                },
                                ...
                            ],
                            ...
                          ]
                        },
                    }
                },
               ...
            ]
        }
    }
}

elasticmachine · 2024-04-18T18:46:51Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine · 2024-04-18T18:46:52Z

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine · 2024-04-18T18:46:53Z

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

kibana-ci · 2024-04-19T09:29:01Z

💚 Build Succeeded

Buildkite Build
Commit: 3bc6a79

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`securitySolution`	5451	5452	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`securitySolution`	14.6MB	14.6MB	+208.0B

History

💚 Build #204689 succeeded be946aa468d77f074cdad40501a70112b37c70e4
💔 Build #204668 failed 7637471439c60544124b7cc009cb05cc99e4240d

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @maximpn

maximpn self-assigned this Apr 18, 2024

maximpn force-pushed the add-top-n-rules-to-health-api branch from 7637471 to be946aa Compare April 18, 2024 16:02

maximpn marked this pull request as ready for review April 18, 2024 18:46

maximpn requested a review from a team as a code owner April 18, 2024 18:46

maximpn requested a review from jpdjere April 18, 2024 18:46

maximpn requested a review from banderror April 18, 2024 18:46

maximpn added 2 commits April 19, 2024 10:16

extends rules monitoring health api with top N rules by metrics

d60df0a

expose num_of_top_rules param

3bc6a79

maximpn force-pushed the add-top-n-rules-to-health-api branch from 3d1d03a to 3bc6a79 Compare April 19, 2024 08:16

banderror removed the request for review from jpdjere April 19, 2024 13:28

banderror marked this pull request as draft June 25, 2024 09:33

banderror removed the v8.15.0 label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution] Extend Detection Engine Health API with top N rules by metrics #181169

[Security Solution] Extend Detection Engine Health API with top N rules by metrics #181169

maximpn commented Apr 18, 2024 •

edited

Loading

elasticmachine commented Apr 18, 2024

elasticmachine commented Apr 18, 2024

elasticmachine commented Apr 18, 2024

kibana-ci commented Apr 19, 2024

[Security Solution] Extend Detection Engine Health API with top N rules by metrics #181169

Are you sure you want to change the base?

[Security Solution] Extend Detection Engine Health API with top N rules by metrics #181169

Conversation

maximpn commented Apr 18, 2024 • edited Loading

Summary

Details

Response example

elasticmachine commented Apr 18, 2024

elasticmachine commented Apr 18, 2024

elasticmachine commented Apr 18, 2024

kibana-ci commented Apr 19, 2024

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

History

maximpn commented Apr 18, 2024 •

edited

Loading