Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.8] [Security Solution] PoC of the Detection Engine health API (#157155) #159717

Merged
merged 2 commits into from
Jun 15, 2023

Conversation

banderror
Copy link
Contributor

Backport

This will backport the following PR from main to 8.8:

Questions ?

Please refer to the Backport tool documentation

…57155)

**Partially addresses:** https://github.com/elastic/kibana/issues/125642

## Summary

This PR introduces a PoC health API that allows users to get health
overview of the Detection Engine across the whole cluster, or within a
given Kibana space, or for a given rule. It can be useful for
troubleshooting issues with cluster provisioning/scaling, issues with
certain rules failing or generating too much load on the cluster,
identifying common rule execution errors, etc.

In the future, this API might become helpful for building more Rule
Monitoring UIs giving our users more clarity and transparency about the
work of the Detection Engine.

## Rule health endpoint

🚧 NOTE: this endpoint is **partially implemented**. 🚧

```txt
POST /internal/detection_engine/health/_rule
```

Get health overview of a rule. Scope: a given detection rule in the
current Kibana space.
Returns:

- health stats at the moment of the API call (rule and its execution
summary)
- health stats over a specified period of time ("health interval")
- health stats history within the same interval in the form of a
histogram
(the same stats are calculated over each of the discreet sub-intervals
of the whole interval)

Minimal required parameters:

```json
{
  "rule_id": "d4beff10-f045-11ed-89d8-3b6931af10bc"
}
```

<details><summary>Response:</summary>
<p>

```json
{
  "timings": {
    "requested_at": "2023-05-26T16:09:54.128Z",
    "processed_at": "2023-05-26T16:09:54.778Z",
    "processing_time_ms": 650
  },
  "parameters": {
    "interval": {
      "type": "last_day",
      "granularity": "hour",
      "from": "2023-05-25T16:09:54.128Z",
      "to": "2023-05-26T16:09:54.128Z",
      "duration": "PT24H"
    },
    "rule_id": "d4beff10-f045-11ed-89d8-3b6931af10bc"
  },
  "health": {
    "stats_at_the_moment": {
      "rule": {
        "id": "d4beff10-f045-11ed-89d8-3b6931af10bc",
        "updated_at": "2023-05-26T15:44:21.689Z",
        "updated_by": "elastic",
        "created_at": "2023-05-11T21:50:23.830Z",
        "created_by": "elastic",
        "name": "Test rule",
        "tags": ["foo"],
        "interval": "1m",
        "enabled": true,
        "revision": 2,
        "description": "-",
        "risk_score": 21,
        "severity": "low",
        "license": "",
        "output_index": "",
        "meta": {
          "from": "6h",
          "kibana_siem_app_url": "http://localhost:5601/kbn/app/security"
        },
        "author": [],
        "false_positives": [],
        "from": "now-21660s",
        "rule_id": "e46eaaf3-6d81-4cdb-8cbb-b2201a11358b",
        "max_signals": 100,
        "risk_score_mapping": [],
        "severity_mapping": [],
        "threat": [],
        "to": "now",
        "references": [],
        "version": 3,
        "exceptions_list": [],
        "immutable": false,
        "related_integrations": [],
        "required_fields": [],
        "setup": "",
        "type": "query",
        "language": "kuery",
        "index": [
          "apm-*-transaction*",
          "auditbeat-*",
          "endgame-*",
          "filebeat-*",
          "logs-*",
          "packetbeat-*",
          "traces-apm*",
          "winlogbeat-*",
          "-*elastic-cloud-logs-*",
          "foo-*"
        ],
        "query": "*",
        "filters": [],
        "actions": [
          {
            "group": "default",
            "id": "bd59c4e0-f045-11ed-89d8-3b6931af10bc",
            "params": {
              "body": "Hello world"
            },
            "action_type_id": ".webhook",
            "uuid": "f8b87eb0-58bb-4d4b-a584-084d44ab847e",
            "frequency": {
              "summary": true,
              "throttle": null,
              "notifyWhen": "onActiveAlert"
            }
          }
        ],
        "execution_summary": {
          "last_execution": {
            "date": "2023-05-26T16:09:36.848Z",
            "status": "succeeded",
            "status_order": 0,
            "message": "Rule execution completed successfully",
            "metrics": {
              "total_search_duration_ms": 2,
              "execution_gap_duration_s": 80395
            }
          }
        }
      }
    },
    "stats_over_interval": {
      "number_of_executions": {
        "total": 21,
        "by_outcome": {
          "succeeded": 20,
          "warning": 0,
          "failed": 1
        }
      },
      "number_of_logged_messages": {
        "total": 42,
        "by_level": {
          "error": 1,
          "warn": 0,
          "info": 41,
          "debug": 0,
          "trace": 0
        }
      },
      "number_of_detected_gaps": {
        "total": 1,
        "total_duration_s": 80395
      },
      "schedule_delay_ms": {
        "percentiles": {
          "1.0": 3061,
          "5.0": 3083,
          "25.0": 3112,
          "50.0": 6049,
          "75.0": 6069.5,
          "95.0": 100093.79999999986,
          "99.0": 207687
        }
      },
      "execution_duration_ms": {
        "percentiles": {
          "1.0": 226,
          "5.0": 228.2,
          "25.0": 355.5,
          "50.0": 422,
          "75.0": 447,
          "95.0": 677.75,
          "99.0": 719
        }
      },
      "search_duration_ms": {
        "percentiles": {
          "1.0": 0,
          "5.0": 1.1,
          "25.0": 2.75,
          "50.0": 7,
          "75.0": 13.5,
          "95.0": 29.59999999999998,
          "99.0": 45
        }
      },
      "indexing_duration_ms": {
        "percentiles": {
          "1.0": 0,
          "5.0": 0,
          "25.0": 0,
          "50.0": 0,
          "75.0": 0,
          "95.0": 0,
          "99.0": 0
        }
      },
      "top_errors": [
        {
          "count": 1,
          "message": "day were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances"
        }
      ],
      "top_warnings": []
    },
    "history_over_interval": {
      "buckets": [
        {
          "timestamp": "2023-05-26T15:00:00.000Z",
          "stats": {
            "number_of_executions": {
              "total": 12,
              "by_outcome": {
                "succeeded": 11,
                "warning": 0,
                "failed": 1
              }
            },
            "number_of_logged_messages": {
              "total": 24,
              "by_level": {
                "error": 1,
                "warn": 0,
                "info": 23,
                "debug": 0,
                "trace": 0
              }
            },
            "number_of_detected_gaps": {
              "total": 1,
              "total_duration_s": 80395
            },
            "schedule_delay_ms": {
              "percentiles": {
                "1.0": 3106,
                "5.0": 3106.8,
                "25.0": 3124.5,
                "50.0": 6067.5,
                "75.0": 9060.5,
                "95.0": 188124.59999999971,
                "99.0": 207687
              }
            },
            "execution_duration_ms": {
              "percentiles": {
                "1.0": 230,
                "5.0": 236.2,
                "25.0": 354,
                "50.0": 405,
                "75.0": 447.5,
                "95.0": 563.3999999999999,
                "99.0": 576
              }
            },
            "search_duration_ms": {
              "percentiles": {
                "1.0": 0,
                "5.0": 0.20000000000000018,
                "25.0": 2.5,
                "50.0": 5,
                "75.0": 14,
                "95.0": 42.19999999999996,
                "99.0": 45
              }
            },
            "indexing_duration_ms": {
              "percentiles": {
                "1.0": 0,
                "5.0": 0,
                "25.0": 0,
                "50.0": 0,
                "75.0": 0,
                "95.0": 0,
                "99.0": 0
              }
            }
          }
        },
        {
          "timestamp": "2023-05-26T16:00:00.000Z",
          "stats": {
            "number_of_executions": {
              "total": 9,
              "by_outcome": {
                "succeeded": 9,
                "warning": 0,
                "failed": 0
              }
            },
            "number_of_logged_messages": {
              "total": 18,
              "by_level": {
                "error": 0,
                "warn": 0,
                "info": 18,
                "debug": 0,
                "trace": 0
              }
            },
            "number_of_detected_gaps": {
              "total": 0,
              "total_duration_s": 0
            },
            "schedule_delay_ms": {
              "percentiles": {
                "1.0": 3061,
                "5.0": 3061,
                "25.0": 3104.75,
                "50.0": 3115,
                "75.0": 6053,
                "95.0": 6068,
                "99.0": 6068
              }
            },
            "execution_duration_ms": {
              "percentiles": {
                "1.0": 226.00000000000003,
                "5.0": 226,
                "25.0": 356,
                "50.0": 436,
                "75.0": 495.5,
                "95.0": 719,
                "99.0": 719
              }
            },
            "search_duration_ms": {
              "percentiles": {
                "1.0": 2,
                "5.0": 2,
                "25.0": 5.75,
                "50.0": 8,
                "75.0": 13.75,
                "95.0": 17,
                "99.0": 17
              }
            },
            "indexing_duration_ms": {
              "percentiles": {
                "1.0": 0,
                "5.0": 0,
                "25.0": 0,
                "50.0": 0,
                "75.0": 0,
                "95.0": 0,
                "99.0": 0
              }
            }
          }
        }
      ]
    }
  }
}
```
</p>
</details>

## Space health endpoint

🚧 NOTE: this endpoint is **partially implemented**. 🚧

```txt
POST /internal/detection_engine/health/_space
```

Get health overview of the current Kibana space. Scope: all detection
rules in the space.
Returns:

- health stats at the moment of the API call
- health stats over a specified period of time ("health interval")
- health stats history within the same interval in the form of a
histogram
(the same stats are calculated over each of the discreet sub-intervals
of the whole interval)

Minimal required parameters: empty object.

```json
{}
```

<details><summary>Response:</summary>
<p>

```json
{
  "timings": {
    "requested_at": "2023-05-26T16:24:21.628Z",
    "processed_at": "2023-05-26T16:24:22.880Z",
    "processing_time_ms": 1252
  },
  "parameters": {
    "interval": {
      "type": "last_day",
      "granularity": "hour",
      "from": "2023-05-25T16:24:21.628Z",
      "to": "2023-05-26T16:24:21.628Z",
      "duration": "PT24H"
    }
  },
  "health": {
    "stats_at_the_moment": {
      "number_of_rules": {
        "all": {
          "total": 777,
          "enabled": 777,
          "disabled": 0
        },
        "by_origin": {
          "prebuilt": {
            "total": 776,
            "enabled": 776,
            "disabled": 0
          },
          "custom": {
            "total": 1,
            "enabled": 1,
            "disabled": 0
          }
        },
        "by_type": {
          "siem.eqlRule": {
            "total": 381,
            "enabled": 381,
            "disabled": 0
          },
          "siem.queryRule": {
            "total": 325,
            "enabled": 325,
            "disabled": 0
          },
          "siem.mlRule": {
            "total": 47,
            "enabled": 47,
            "disabled": 0
          },
          "siem.thresholdRule": {
            "total": 18,
            "enabled": 18,
            "disabled": 0
          },
          "siem.newTermsRule": {
            "total": 4,
            "enabled": 4,
            "disabled": 0
          },
          "siem.indicatorRule": {
            "total": 2,
            "enabled": 2,
            "disabled": 0
          }
        },
        "by_outcome": {
          "warning": {
            "total": 307,
            "enabled": 307,
            "disabled": 0
          },
          "succeeded": {
            "total": 266,
            "enabled": 266,
            "disabled": 0
          },
          "failed": {
            "total": 204,
            "enabled": 204,
            "disabled": 0
          }
        }
      }
    },
    "stats_over_interval": {
      "number_of_executions": {
        "total": 5622,
        "by_outcome": {
          "succeeded": 1882,
          "warning": 2129,
          "failed": 2120
        }
      },
      "number_of_logged_messages": {
        "total": 11756,
        "by_level": {
          "error": 2120,
          "warn": 2129,
          "info": 7507,
          "debug": 0,
          "trace": 0
        }
      },
      "number_of_detected_gaps": {
        "total": 777,
        "total_duration_s": 514415894
      },
      "schedule_delay_ms": {
        "percentiles": {
          "1.0": 216,
          "5.0": 3048.5,
          "25.0": 3105,
          "50.0": 3129,
          "75.0": 6112.355119825708,
          "95.0": 134006,
          "99.0": 195578
        }
      },
      "execution_duration_ms": {
        "percentiles": {
          "1.0": 275,
          "5.0": 323.375,
          "25.0": 370.80555555555554,
          "50.0": 413.1122337092731,
          "75.0": 502.25233127864715,
          "95.0": 685.8055555555555,
          "99.0": 1194.75
        }
      },
      "search_duration_ms": {
        "percentiles": {
          "1.0": 0,
          "5.0": 0,
          "25.0": 0,
          "50.0": 0,
          "75.0": 15,
          "95.0": 30,
          "99.0": 99.44000000000005
        }
      },
      "indexing_duration_ms": {
        "percentiles": {
          "1.0": 0,
          "5.0": 0,
          "25.0": 0,
          "50.0": 0,
          "75.0": 0,
          "95.0": 0,
          "99.0": 0
        }
      },
      "top_errors": [
        {
          "count": 1202,
          "message": "An error occurred during rule execution message verification_exception"
        },
        {
          "count": 777,
          "message": "were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances"
        },
        {
          "count": 3,
          "message": "An error occurred during rule execution message rare_error_code missing"
        },
        {
          "count": 3,
          "message": "An error occurred during rule execution message v3_windows_anomalous_path_activity missing"
        },
        {
          "count": 3,
          "message": "An error occurred during rule execution message v3_windows_rare_user_type10_remote_login missing"
        }
      ],
      "top_warnings": [
        {
          "count": 2129,
          "message": "This rule is attempting to query data from Elasticsearch indices listed in the Index pattern section of the rule definition however no index matching was found This warning will continue to appear until matching index is created or this rule is disabled"
        }
      ]
    },
    "history_over_interval": {
      "buckets": [
        {
          "timestamp": "2023-05-26T15:00:00.000Z",
          "stats": {
            "number_of_executions": {
              "total": 2245,
              "by_outcome": {
                "succeeded": 566,
                "warning": 849,
                "failed": 1336
              }
            },
            "number_of_logged_messages": {
              "total": 4996,
              "by_level": {
                "error": 1336,
                "warn": 849,
                "info": 2811,
                "debug": 0,
                "trace": 0
              }
            },
            "number_of_detected_gaps": {
              "total": 777,
              "total_duration_s": 514415894
            },
            "schedule_delay_ms": {
              "percentiles": {
                "1.0": 256,
                "5.0": 3086.9722222222217,
                "25.0": 3133,
                "50.0": 6126,
                "75.0": 59484.25,
                "95.0": 179817.25,
                "99.0": 202613
              }
            },
            "execution_duration_ms": {
              "percentiles": {
                "1.0": 280.6,
                "5.0": 327.7,
                "25.0": 371.5208333333333,
                "50.0": 415.6190476190476,
                "75.0": 505.7642857142857,
                "95.0": 740.4375,
                "99.0": 1446.1500000000005
              }
            },
            "search_duration_ms": {
              "percentiles": {
                "1.0": 0,
                "5.0": 0,
                "25.0": 0,
                "50.0": 0,
                "75.0": 8,
                "95.0": 25,
                "99.0": 46
              }
            },
            "indexing_duration_ms": {
              "percentiles": {
                "1.0": 0,
                "5.0": 0,
                "25.0": 0,
                "50.0": 0,
                "75.0": 0,
                "95.0": 0,
                "99.0": 0
              }
            }
          }
        },
        {
          "timestamp": "2023-05-26T16:00:00.000Z",
          "stats": {
            "number_of_executions": {
              "total": 3363,
              "by_outcome": {
                "succeeded": 1316,
                "warning": 1280,
                "failed": 784
              }
            },
            "number_of_logged_messages": {
              "total": 6760,
              "by_level": {
                "error": 784,
                "warn": 1280,
                "info": 4696,
                "debug": 0,
                "trace": 0
              }
            },
            "number_of_detected_gaps": {
              "total": 0,
              "total_duration_s": 0
            },
            "schedule_delay_ms": {
              "percentiles": {
                "1.0": 207,
                "5.0": 3042,
                "25.0": 3098.46511627907,
                "50.0": 3112,
                "75.0": 3145.2820512820517,
                "95.0": 6100.571428571428,
                "99.0": 6123
              }
            },
            "execution_duration_ms": {
              "percentiles": {
                "1.0": 275,
                "5.0": 319.85714285714283,
                "25.0": 370.0357142857143,
                "50.0": 410.79999229108853,
                "75.0": 500.7692307692308,
                "95.0": 675,
                "99.0": 781.3999999999996
              }
            },
            "search_duration_ms": {
              "percentiles": {
                "1.0": 0,
                "5.0": 0,
                "25.0": 0,
                "50.0": 9,
                "75.0": 17.555555555555557,
                "95.0": 34,
                "99.0": 110.5
              }
            },
            "indexing_duration_ms": {
              "percentiles": {
                "1.0": 0,
                "5.0": 0,
                "25.0": 0,
                "50.0": 0,
                "75.0": 0,
                "95.0": 0,
                "99.0": 0
              }
            }
          }
        }
      ]
    }
  }
}
```
</p>
</details>

## Cluster health endpoint

🚧 NOTE: this endpoint is **not implemented**. 🚧

```txt
POST /internal/detection_engine/health/_cluster
```

Minimal required parameters: empty object.

```json
{}
```

<details><summary>Response:</summary>
<p>

```json
{
  "message": "Not implemented",
  "timings": {
    "requested_at": "2023-05-26T16:32:01.878Z",
    "processed_at": "2023-05-26T16:32:01.881Z",
    "processing_time_ms": 3
  },
  "parameters": {
    "interval": {
      "type": "last_week",
      "granularity": "hour",
      "from": "2023-05-19T16:32:01.878Z",
      "to": "2023-05-26T16:32:01.878Z",
      "duration": "PT168H"
    }
  },
  "health": {
    "stats_at_the_moment": {
      "number_of_rules": {
        "all": {
          "total": 0,
          "enabled": 0,
          "disabled": 0
        },
        "by_origin": {
          "prebuilt": {
            "total": 0,
            "enabled": 0,
            "disabled": 0
          },
          "custom": {
            "total": 0,
            "enabled": 0,
            "disabled": 0
          }
        },
        "by_type": {},
        "by_outcome": {}
      }
    },
    "stats_over_interval": {
      "message": "Not implemented"
    },
    "history_over_interval": {
      "buckets": []
    }
  }
}
```
</p>
</details>

## Optional parameters

All the three endpoints accept optional `interval` and `debug` request
parameters.

### Health interval

You can change the interval over which the health stats will be
calculated. If you don't specify it, by default health stats will be
calculated over the last day with the granularity of 1 hour.

```json
{
  "interval": {
    "type": "last_week",
    "granularity": "day"
  }
}
```

You can also specify a custom date range with exact interval bounds.

```json
{
  "interval": {
    "type": "custom_range",
    "granularity": "minute",
    "from": "2023-05-20T16:24:21.628Z",
    "to": "2023-05-26T16:24:21.628Z"
  }
}
```

Please keep in mind that requesting large intervals with small
granularity can generate substantial load on the system and enormous API
responses.

### Debug mode

You can also include various debug information in the response, such as
queries and aggregations sent to Elasticsearch and response received
from it.

```json
{
  "debug": true
}
```

In the response you will find something like that:

<details><summary>Response:</summary>
<p>

```json
{
  "health": {
    "debug": {
      "rulesClient": {
        "request": {
          "aggs": {
            "rulesByEnabled": {
              "terms": {
                "field": "alert.attributes.enabled"
              }
            },
            "rulesByOrigin": {
              "terms": {
                "field": "alert.attributes.params.immutable"
              },
              "aggs": {
                "rulesByEnabled": {
                  "terms": {
                    "field": "alert.attributes.enabled"
                  }
                }
              }
            },
            "rulesByType": {
              "terms": {
                "field": "alert.attributes.alertTypeId"
              },
              "aggs": {
                "rulesByEnabled": {
                  "terms": {
                    "field": "alert.attributes.enabled"
                  }
                }
              }
            },
            "rulesByOutcome": {
              "terms": {
                "field": "alert.attributes.lastRun.outcome"
              },
              "aggs": {
                "rulesByEnabled": {
                  "terms": {
                    "field": "alert.attributes.enabled"
                  }
                }
              }
            }
          }
        },
        "response": {
          "aggregations": {
            "rulesByOutcome": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "warning",
                  "doc_count": 307,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 307
                      }
                    ]
                  }
                },
                {
                  "key": "succeeded",
                  "doc_count": 266,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 266
                      }
                    ]
                  }
                },
                {
                  "key": "failed",
                  "doc_count": 204,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 204
                      }
                    ]
                  }
                }
              ]
            },
            "rulesByType": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "siem.eqlRule",
                  "doc_count": 381,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 381
                      }
                    ]
                  }
                },
                {
                  "key": "siem.queryRule",
                  "doc_count": 325,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 325
                      }
                    ]
                  }
                },
                {
                  "key": "siem.mlRule",
                  "doc_count": 47,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 47
                      }
                    ]
                  }
                },
                {
                  "key": "siem.thresholdRule",
                  "doc_count": 18,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 18
                      }
                    ]
                  }
                },
                {
                  "key": "siem.newTermsRule",
                  "doc_count": 4,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 4
                      }
                    ]
                  }
                },
                {
                  "key": "siem.indicatorRule",
                  "doc_count": 2,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 2
                      }
                    ]
                  }
                }
              ]
            },
            "rulesByOrigin": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "true",
                  "doc_count": 776,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 776
                      }
                    ]
                  }
                },
                {
                  "key": "false",
                  "doc_count": 1,
                  "rulesByEnabled": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                      {
                        "key": 1,
                        "key_as_string": "true",
                        "doc_count": 1
                      }
                    ]
                  }
                }
              ]
            },
            "rulesByEnabled": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": 1,
                  "key_as_string": "true",
                  "doc_count": 777
                }
              ]
            }
          }
        }
      },
      "eventLog": {
        "request": {
          "aggs": {
            "totalExecutions": {
              "cardinality": {
                "field": "kibana.alert.rule.execution.uuid"
              }
            },
            "executeEvents": {
              "filter": {
                "term": {
                  "event.action": "execute"
                }
              },
              "aggs": {
                "executionDurationMs": {
                  "percentiles": {
                    "field": "kibana.alert.rule.execution.metrics.total_run_duration_ms",
                    "missing": 0,
                    "percents": [1, 5, 25, 50, 75, 95, 99]
                  }
                },
                "scheduleDelayNs": {
                  "percentiles": {
                    "field": "kibana.task.schedule_delay",
                    "missing": 0,
                    "percents": [1, 5, 25, 50, 75, 95, 99]
                  }
                }
              }
            },
            "statusChangeEvents": {
              "filter": {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "event.action": "status-change"
                      }
                    }
                  ],
                  "must_not": [
                    {
                      "terms": {
                        "kibana.alert.rule.execution.status": ["running", "going to run"]
                      }
                    }
                  ]
                }
              },
              "aggs": {
                "executionsByStatus": {
                  "terms": {
                    "field": "kibana.alert.rule.execution.status"
                  }
                }
              }
            },
            "executionMetricsEvents": {
              "filter": {
                "term": {
                  "event.action": "execution-metrics"
                }
              },
              "aggs": {
                "gaps": {
                  "filter": {
                    "exists": {
                      "field": "kibana.alert.rule.execution.metrics.execution_gap_duration_s"
                    }
                  },
                  "aggs": {
                    "totalGapDurationS": {
                      "sum": {
                        "field": "kibana.alert.rule.execution.metrics.execution_gap_duration_s"
                      }
                    }
                  }
                },
                "searchDurationMs": {
                  "percentiles": {
                    "field": "kibana.alert.rule.execution.metrics.total_search_duration_ms",
                    "missing": 0,
                    "percents": [1, 5, 25, 50, 75, 95, 99]
                  }
                },
                "indexingDurationMs": {
                  "percentiles": {
                    "field": "kibana.alert.rule.execution.metrics.total_indexing_duration_ms",
                    "missing": 0,
                    "percents": [1, 5, 25, 50, 75, 95, 99]
                  }
                }
              }
            },
            "messageContainingEvents": {
              "filter": {
                "terms": {
                  "event.action": ["status-change", "message"]
                }
              },
              "aggs": {
                "messagesByLogLevel": {
                  "terms": {
                    "field": "log.level"
                  }
                },
                "errors": {
                  "filter": {
                    "term": {
                      "log.level": "error"
                    }
                  },
                  "aggs": {
                    "topErrors": {
                      "categorize_text": {
                        "field": "message",
                        "size": 5,
                        "similarity_threshold": 99
                      }
                    }
                  }
                },
                "warnings": {
                  "filter": {
                    "term": {
                      "log.level": "warn"
                    }
                  },
                  "aggs": {
                    "topWarnings": {
                      "categorize_text": {
                        "field": "message",
                        "size": 5,
                        "similarity_threshold": 99
                      }
                    }
                  }
                }
              }
            },
            "statsHistory": {
              "date_histogram": {
                "field": "@timestamp",
                "calendar_interval": "hour"
              },
              "aggs": {
                "totalExecutions": {
                  "cardinality": {
                    "field": "kibana.alert.rule.execution.uuid"
                  }
                },
                "executeEvents": {
                  "filter": {
                    "term": {
                      "event.action": "execute"
                    }
                  },
                  "aggs": {
                    "executionDurationMs": {
                      "percentiles": {
                        "field": "kibana.alert.rule.execution.metrics.total_run_duration_ms",
                        "missing": 0,
                        "percents": [1, 5, 25, 50, 75, 95, 99]
                      }
                    },
                    "scheduleDelayNs": {
                      "percentiles": {
                        "field": "kibana.task.schedule_delay",
                        "missing": 0,
                        "percents": [1, 5, 25, 50, 75, 95, 99]
                      }
                    }
                  }
                },
                "statusChangeEvents": {
                  "filter": {
                    "bool": {
                      "filter": [
                        {
                          "term": {
                            "event.action": "status-change"
                          }
                        }
                      ],
                      "must_not": [
                        {
                          "terms": {
                            "kibana.alert.rule.execution.status": ["running", "going to run"]
                          }
                        }
                      ]
                    }
                  },
                  "aggs": {
                    "executionsByStatus": {
                      "terms": {
                        "field": "kibana.alert.rule.execution.status"
                      }
                    }
                  }
                },
                "executionMetricsEvents": {
                  "filter": {
                    "term": {
                      "event.action": "execution-metrics"
                    }
                  },
                  "aggs": {
                    "gaps": {
                      "filter": {
                        "exists": {
                          "field": "kibana.alert.rule.execution.metrics.execution_gap_duration_s"
                        }
                      },
                      "aggs": {
                        "totalGapDurationS": {
                          "sum": {
                            "field": "kibana.alert.rule.execution.metrics.execution_gap_duration_s"
                          }
                        }
                      }
                    },
                    "searchDurationMs": {
                      "percentiles": {
                        "field": "kibana.alert.rule.execution.metrics.total_search_duration_ms",
                        "missing": 0,
                        "percents": [1, 5, 25, 50, 75, 95, 99]
                      }
                    },
                    "indexingDurationMs": {
                      "percentiles": {
                        "field": "kibana.alert.rule.execution.metrics.total_indexing_duration_ms",
                        "missing": 0,
                        "percents": [1, 5, 25, 50, 75, 95, 99]
                      }
                    }
                  }
                },
                "messageContainingEvents": {
                  "filter": {
                    "terms": {
                      "event.action": ["status-change", "message"]
                    }
                  },
                  "aggs": {
                    "messagesByLogLevel": {
                      "terms": {
                        "field": "log.level"
                      }
                    }
                  }
                }
              }
            }
          }
        },
        "response": {
          "aggregations": {
            "statsHistory": {
              "buckets": [
                {
                  "key_as_string": "2023-05-26T15:00:00.000Z",
                  "key": 1685113200000,
                  "doc_count": 11388,
                  "statusChangeEvents": {
                    "doc_count": 2751,
                    "executionsByStatus": {
                      "doc_count_error_upper_bound": 0,
                      "sum_other_doc_count": 0,
                      "buckets": [
                        {
                          "key": "failed",
                          "doc_count": 1336
                        },
                        {
                          "key": "partial failure",
                          "doc_count": 849
                        },
                        {
                          "key": "succeeded",
                          "doc_count": 566
                        }
                      ]
                    }
                  },
                  "totalExecutions": {
                    "value": 2245
                  },
                  "messageContainingEvents": {
                    "doc_count": 4996,
                    "messagesByLogLevel": {
                      "doc_count_error_upper_bound": 0,
                      "sum_other_doc_count": 0,
                      "buckets": [
                        {
                          "key": "info",
                          "doc_count": 2811
                        },
                        {
                          "key": "error",
                          "doc_count": 1336
                        },
                        {
                          "key": "warn",
                          "doc_count": 849
                        }
                      ]
                    }
                  },
                  "executeEvents": {
                    "doc_count": 2245,
                    "scheduleDelayNs": {
                      "values": {
                        "1.0": 256000000,
                        "5.0": 3086972222.222222,
                        "25.0": 3133000000,
                        "50.0": 6126000000,
                        "75.0": 59484250000,
                        "95.0": 179817250000,
                        "99.0": 202613000000
                      }
                    },
                    "executionDurationMs": {
                      "values": {
                        "1.0": 280.6,
                        "5.0": 327.7,
                        "25.0": 371.5208333333333,
                        "50.0": 415.6190476190476,
                        "75.0": 505.575,
                        "95.0": 740.4375,
                        "99.0": 1446.1500000000005
                      }
                    }
                  },
                  "executionMetricsEvents": {
                    "doc_count": 1902,
                    "searchDurationMs": {
                      "values": {
                        "1.0": 0,
                        "5.0": 0,
                        "25.0": 0,
                        "50.0": 0,
                        "75.0": 8,
                        "95.0": 25,
                        "99.0": 46
                      }
                    },
                    "gaps": {
                      "doc_count": 777,
                      "totalGapDurationS": {
                        "value": 514415894
                      }
                    },
                    "indexingDurationMs": {
                      "values": {
                        "1.0": 0,
                        "5.0": 0,
                        "25.0": 0,
                        "50.0": 0,
                        "75.0": 0,
                        "95.0": 0,
                        "99.0": 0
                      }
                    }
                  }
                },
                {
                  "key_as_string": "2023-05-26T16:00:00.000Z",
                  "key": 1685116800000,
                  "doc_count": 28325,
                  "statusChangeEvents": {
                    "doc_count": 6126,
                    "executionsByStatus": {
                      "doc_count_error_upper_bound": 0,
                      "sum_other_doc_count": 0,
                      "buckets": [
                        {
                          "key": "succeeded",
                          "doc_count": 2390
                        },
                        {
                          "key": "partial failure",
                          "doc_count": 2305
                        },
                        {
                          "key": "failed",
                          "doc_count": 1431
                        }
                      ]
                    }
                  },
                  "totalExecutions": {
                    "value": 6170
                  },
                  "messageContainingEvents": {
                    "doc_count": 12252,
                    "messagesByLogLevel": {
                      "doc_count_error_upper_bound": 0,
                      "sum_other_doc_count": 0,
                      "buckets": [
                        {
                          "key": "info",
                          "doc_count": 8516
                        },
                        {
                          "key": "warn",
                          "doc_count": 2305
                        },
                        {
                          "key": "error",
                          "doc_count": 1431
                        }
                      ]
                    }
                  },
                  "executeEvents": {
                    "doc_count": 6126,
                    "scheduleDelayNs": {
                      "values": {
                        "1.0": 193000000,
                        "5.0": 3017785185.1851854,
                        "25.0": 3086000000,
                        "50.0": 3105877192.982456,
                        "75.0": 3134645161.290323,
                        "95.0": 6081772222.222222,
                        "99.0": 6122000000
                      }
                    },
                    "executionDurationMs": {
                      "values": {
                        "1.0": 275.17333333333335,
                        "5.0": 324.8014285714285,
                        "25.0": 377.0752688172043,
                        "50.0": 431,
                        "75.0": 532.3870967741935,
                        "95.0": 720.6761904761904,
                        "99.0": 922.6799999999985
                      }
                    }
                  },
                  "executionMetricsEvents": {
                    "doc_count": 3821,
                    "searchDurationMs": {
                      "values": {
                        "1.0": 0,
                        "5.0": 0,
                        "25.0": 0,
                        "50.0": 9.8,
                        "75.0": 18,
                        "95.0": 40.17499999999999,
                        "99.0": 124
                      }
                    },
                    "gaps": {
                      "doc_count": 0,
                      "totalGapDurationS": {
                        "value": 0
                      }
                    },
                    "indexingDurationMs": {
                      "values": {
                        "1.0": 0,
                        "5.0": 0,
                        "25.0": 0,
                        "50.0": 0,
                        "75.0": 0,
                        "95.0": 0,
                        "99.0": 0
                      }
                    }
                  }
                }
              ]
            },
            "statusChangeEvents": {
              "doc_count": 8877,
              "executionsByStatus": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                  {
                    "key": "partial failure",
                    "doc_count": 3154
                  },
                  {
                    "key": "succeeded",
                    "doc_count": 2956
                  },
                  {
                    "key": "failed",
                    "doc_count": 2767
                  }
                ]
              }
            },
            "totalExecutions": {
              "value": 8455
            },
            "messageContainingEvents": {
              "doc_count": 17248,
              "messagesByLogLevel": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                  {
                    "key": "info",
                    "doc_count": 11327
                  },
                  {
                    "key": "warn",
                    "doc_count": 3154
                  },
                  {
                    "key": "error",
                    "doc_count": 2767
                  }
                ]
              },
              "warnings": {
                "doc_count": 3154,
                "topWarnings": {
                  "buckets": [
                    {
                      "doc_count": 3154,
                      "key": "This rule is attempting to query data from Elasticsearch indices listed in the Index pattern section of the rule definition however no index matching was found This warning will continue to appear until matching index is created or this rule is disabled",
                      "regex": ".*?This.+?rule.+?is.+?attempting.+?to.+?query.+?data.+?from.+?Elasticsearch.+?indices.+?listed.+?in.+?the.+?Index.+?pattern.+?section.+?of.+?the.+?rule.+?definition.+?however.+?no.+?index.+?matching.+?was.+?found.+?This.+?warning.+?will.+?continue.+?to.+?appear.+?until.+?matching.+?index.+?is.+?created.+?or.+?this.+?rule.+?is.+?disabled.*?",
                      "max_matching_length": 342
                    }
                  ]
                }
              },
              "errors": {
                "doc_count": 2767,
                "topErrors": {
                  "buckets": [
                    {
                      "doc_count": 1802,
                      "key": "An error occurred during rule execution message verification_exception",
                      "regex": ".*?An.+?error.+?occurred.+?during.+?rule.+?execution.+?message.+?verification_exception.*?",
                      "max_matching_length": 2064
                    },
                    {
                      "doc_count": 777,
                      "key": "were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances",
                      "regex": ".*?were.+?not.+?queried.+?between.+?this.+?rule.+?execution.+?and.+?the.+?last.+?execution.+?so.+?signals.+?may.+?have.+?been.+?missed.+?Consider.+?increasing.+?your.+?look.+?behind.+?time.+?or.+?adding.+?more.+?Kibana.+?instances.*?",
                      "max_matching_length": 216
                    },
                    {
                      "doc_count": 4,
                      "key": "An error occurred during rule execution message rare_error_code missing",
                      "regex": ".*?An.+?error.+?occurred.+?during.+?rule.+?execution.+?message.+?rare_error_code.+?missing.*?",
                      "max_matching_length": 82
                    },
                    {
                      "doc_count": 4,
                      "key": "An error occurred during rule execution message v3_windows_anomalous_path_activity missing",
                      "regex": ".*?An.+?error.+?occurred.+?during.+?rule.+?execution.+?message.+?v3_windows_anomalous_path_activity.+?missing.*?",
                      "max_matching_length": 103
                    },
                    {
                      "doc_count": 4,
                      "key": "An error occurred during rule execution message v3_windows_rare_user_type10_remote_login missing",
                      "regex": ".*?An.+?error.+?occurred.+?during.+?rule.+?execution.+?message.+?v3_windows_rare_user_type10_remote_login.+?missing.*?",
                      "max_matching_length": 110
                    }
                  ]
                }
              }
            },
            "executeEvents": {
              "doc_count": 8371,
              "scheduleDelayNs": {
                "values": {
                  "1.0": 206000000,
                  "5.0": 3027000000,
                  "25.0": 3092000000,
                  "50.0": 3116000000,
                  "75.0": 3278666666.6666665,
                  "95.0": 99656950000,
                  "99.0": 186632790000
                }
              },
              "executionDurationMs": {
                "values": {
                  "1.0": 275.5325,
                  "5.0": 326.07857142857137,
                  "25.0": 375.68969144460027,
                  "50.0": 427,
                  "75.0": 526.2948717948718,
                  "95.0": 727.2480952380952,
                  "99.0": 1009.5299999999934
                }
              }
            },
            "executionMetricsEvents": {
              "doc_count": 5723,
              "searchDurationMs": {
                "values": {
                  "1.0": 0,
                  "5.0": 0,
                  "25.0": 0,
                  "50.0": 4,
                  "75.0": 16,
                  "95.0": 34.43846153846145,
                  "99.0": 116.51333333333302
                }
              },
              "gaps": {
                "doc_count": 777,
                "totalGapDurationS": {
                  "value": 514415894
                }
              },
              "indexingDurationMs": {
                "values": {
                  "1.0": 0,
                  "5.0": 0,
                  "25.0": 0,
                  "50.0": 0,
                  "75.0": 0,
                  "95.0": 0,
                  "99.0": 0
                }
              }
            }
          }
        }
      }
    }
  }
}
```
</p>
</details>

## Other notes

I'm thinking about backporting it to `8.8` so it could become available
in `8.8.1+` clusters.

In the next episodes:

- Add support for it to the
[support-diagnostics](https://github.com/elastic/support-diagnostics)
tool so its output could be available in the diagnostic dumps.
- Implement the cluster health endpoint.
- Calculate more metrics for the rule health endpoint.
- Calculate more metrics for the space health endpoint.
- Etc - see the to-do list in the epic.

### Checklist

Delete any items that are not applicable to this PR.

- [x]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- See dev docs added in
`x-pack/plugins/security_solution/common/detection_engine/rule_monitoring/api/detection_engine_health/README.md`
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

(cherry picked from commit 8ac83df6202759d38b04a781d8739c0ce3e0986e)

# Conflicts:
#	x-pack/plugins/security_solution/server/plugin.ts
@banderror
Copy link
Contributor Author

buildkite test this

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
securitySolution 4054 4061 +7

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/securitysolution-io-ts-types 36 37 +1
securitySolution 76 77 +1
total +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
securitySolution 10.6MB 10.6MB +4.7KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
securitySolution 27 28 +1
Unknown metric groups

API count

id before after diff
@kbn/securitysolution-io-ts-types 65 66 +1
securitySolution 117 118 +1
total +2

ESLint disabled line counts

id before after diff
enterpriseSearch 17 19 +2
securitySolution 400 405 +5
total +7

Total ESLint disabled count

id before after diff
enterpriseSearch 18 20 +2
securitySolution 480 485 +5
total +7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @banderror

@banderror banderror merged commit c65f896 into elastic:8.8 Jun 15, 2023
@banderror banderror deleted the backport/8.8/pr-157155 branch June 15, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants