[RCA] AI-assisted root cause analysis #197200

dgieselaar · 2024-10-22T10:13:41Z

Implements an LLM-based root cause analysis process. At a high level, it works by investigating entities - which means pulling in alerts, SLOs, and log patterns. From there, it can inspect related entities to get to the root cause.

The backend implementation lives in x-pack/packages/observability_utils-* (service_rca). It can be imported into any server-side plugin and executed from there.

The UI changes are mostly contained to x-pack/plugins/observability_solution/observabillity_ai_assistant_app. This plugin now exports a RootCauseAnalysisContainer which takes a stream of data that is returned by the root cause analysis process.

The current implementation lives in the Investigate app. There, it calls its own endpoint that kicks off the RCA process, and feeds it into the RootCauseAnalysisContainer exposed by the Observability AI Assistant app plugin. I've left it in a route there so the investigation itself can be updated as the process runs - this would allow the user to close the browser and come back later, and see a full investigation.

Note

Notes for reviewing teams

@kbn/es-types:

support both types and typesWithBodyKey
simplify KeysOfSources type

@kbn/server-route-repository:

abortable streamed responses

@kbn/sse-utils*:

abortable streamed responses
serialize errors in specific format for more reliable re-hydration of errors
keep connection open with SSE comments

@kbn/inference-*:

export *Of variants of types, for easier manual inference
add automated retries for output API
add name to tool responses for type inference (get type of tool response via tool name)
add data to tool responses for transporting internal data (not sent to the LLM)
simplify chunksIntoMessage
allow consumers of nlToEsql task to add to system prompt
add toolCallId to validation error message

@kbn/aiops*:

export categorizationAnalyzer for use in observability-ai*

@kbn/observability-ai-assistant*

configurable limit (tokens or doc count) for knowledge base recall

@kbn/slo*:

export client that returns summary indices

…ocs'

dominiqueclarke · 2024-10-22T19:01:11Z

...es/observability/observability_utils/observability_utils_server/entities/get_log_patterns.ts

+        field
+      ): [
+        string,
+        {


I would appreciate this type pulled out as it's difficult to process nested.

agreed, was being lazy there 😄

dominiqueclarke · 2024-10-22T19:06:09Z

...es/observability/observability_utils/observability_utils_server/entities/get_log_patterns.ts

+            size,
+            categorization_analyzer: useMlStandardTokenizer
+              ? {
+                  tokenizer: 'ml_standard',


My understanding was that we wanted to avoid the ml_standard tokenizer. I see there's been some discussions internally about that both with myself, you, and @weltenwort

I suppose it's fine if we're being very deliberate about when we use it via the useMLStandardTokenizer variable.

I suggested this a few times in the thread, the approach here is as follows:

use random sampling + standard tokenizer to get categories that are appearing most often

exclude these patterns in an unsampled second request (which is possible by using the standard tokenizer) to get patterns that appear less often and use the ML standard tokenizer which gives better results on low frequency patterns

Random sampling is disabled for now because of a bug, but I'll re-enable that and use a terms agg on _id instead of a top_hits agg, and then get the documents in a follow-up mget request.

FWIW, I think we're better off talking about these patterns as "less frequent" or "low volume" - when I think about "rare" I think about the ML function, where it means that a pattern appears that is "less rare" than before, ie, it is new, or only appearing slightly more than before. ie, with your strategy, we are not detecting a change, we are simply detecting low volume. With the ML function, we are detecting a change. I think "low frequency" is less useful than "rare'.

dominiqueclarke · 2024-10-22T19:08:51Z

...es/observability/observability_utils/observability_utils_server/entities/get_log_patterns.ts

+  size,
+  start,
+  end,
+}: CategorizeTextOptions & { changes?: boolean }): Promise<Array<FieldPatternResult<boolean>>> {


Suggested change

}: CategorizeTextOptions & { changes?: boolean }): Promise<Array<FieldPatternResult<boolean>>> {

}: CategorizeTextOptions & { includeChanges?: boolean }): Promise<Array<FieldPatternResult<boolean>>> {

better indeed 👍

dominiqueclarke · 2024-10-23T00:55:23Z

...es/observability/observability_utils/observability_utils_server/entities/get_log_patterns.ts

+        size: 1000,
+        changes,
+        samplingProbability: 1,
+        useMlStandardTokenizer: true,


Why use the standard tokenizer on the second query?

dominiqueclarke · 2024-10-23T00:59:18Z

...es/observability/observability_utils/observability_utils_server/entities/get_log_patterns.ts

+        },
+        samplingProbability,
+        useMlStandardTokenizer: false,
+        size: 50,


50 seems somewhat small to consider the patterns found in the second query truly rare.

dominiqueclarke · 2024-10-23T01:02:31Z

...ity/observability_utils/observability_utils_server/entities/signals/get_alerts_for_entity.ts

+}) {
+  const alertsKuery = Object.entries(entity)
+    .map(([field, value]) => {
+      return `(${ALERT_GROUP_FIELD}:"${field}" AND ${ALERT_GROUP_VALUE}:"${value}")`;


I believe some of the rules are missing adding entity fields, like service.name, under kibana.alert.group.field and kibana.alert.group.value, but do have the ecs fields like service.name in the AAD.

dominiqueclarke · 2024-10-23T01:17:10Z

...ckages/observability/observability_utils/observability_utils_server/llm/service_rca/index.ts

+import { writeRcaReport } from './write_rca_report';
+import { generateSignificantEventsTimeline, SignificantEventsTimeline } from './generate_timeline';
+
+const tools = {


Could you please explain the concept of tools?

How do tools differ from function calling?

What does the schema for tools represent?

Could you provide an explanation of how tools are called, what data is available to a tool, and how to register new tools? Or point us in the direction of any documentation.

tools are function calling! we have docs here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/observability_ai_assistant/README.md

dominiqueclarke · 2024-10-23T01:35:06Z

...ckages/observability/observability_utils/observability_utils_server/llm/service_rca/index.ts

+  logSources,
+  spaceId,
+  connectorId,
+  inferenceClient,


Could you please help our team understand the difference between integrating with the inference client and integrating with the obs ai assistant client?

the inference client is (right now) a wrapper around the GenAI connectors and eventually around the ES _inference API. We'll gradually move to the inference plugin everywhere as its capabilities improve. Docs here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/inference/README.md

…lm-rca-kubecon

…lm-rca-kubecon-v2

elasticmachine · 2024-12-10T17:09:07Z

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

elasticmachine · 2024-12-10T17:09:07Z

Pinging @elastic/obs-ai-assistant (Team:Obs AI Assistant)

github-actions · 2024-12-10T17:09:15Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

jgowdyelastic

ML changes LGTM

pgayvallet

LGTM, just a couple comments

pgayvallet · 2024-12-11T08:40:21Z

x-pack/platform/plugins/shared/inference/common/output/create_output_api.ts

+    if (stream && retry?.onValidationError) {
+      throw new Error(`Retry options are not supported in streaming mode`);
+    }


NIT: stream && retry?.onValidationError !== false?

retry?.onValidationError !== undefined ?

pgayvallet · 2024-12-11T08:42:47Z

x-pack/platform/plugins/shared/inference/common/output/create_output_api.ts

+              functionCalling,
+              stream: false,
+              retry: {
+                onValidationError: Number(retry.onValidationError) || false,


Don't you want to decrement that?

yes, I forgot to add a test too, so thanks

crespocarlos · 2024-12-11T08:45:01Z

...es/observability/observability_utils/observability_utils_server/entities/get_log_patterns.ts

+                    change_point: change.change_point,
+                    p_value: change.p_value,


Maybe?

Suggested change

change_point: change.change_point,

p_value: change.p_value,

changePoint: change.change_point,

pValue: change.p_value,

crespocarlos · 2024-12-11T08:46:46Z

...es/observability/observability_utils/observability_utils_server/entities/get_log_patterns.ts

+      }
+
+      const patternsToExclude = topMessagePatterns.filter((pattern) => {
+        const complexity = pattern.regex.match(/(\.\+\?)|(\.\*\?)/g)?.length ?? 0;


Could you please explain what this regex does?

I will add a comment - it counts the amount of capturing groups basically. ES will barf if there's too much of them in a regex query.

crespocarlos · 2024-12-11T08:47:01Z

...es/observability/observability_utils/observability_utils_server/entities/get_log_patterns.ts

+          complexity <= 25 &&
+          // anything less than 50 messages should be re-processed with the ml_standard tokenizer
+          pattern.count > 50


Maybe create constants for these numbers?

yeah hmm I'm not using them anywhere else, and in this case people can immediately see what the value is without moving up & down the file

crespocarlos · 2024-12-11T08:47:25Z

...lity/observability_utils/observability_utils_server/entities/get_entities_by_fuzzy_search.ts

+  const [field, value] = Object.entries(entity)[0];
+
+  const { terms } = await esClient.client.termsEnum({
+    index: castArray(index).join(','),


nit

Suggested change

index: castArray(index).join(','),

index: castArray(index).join(),

hm, no right? that will join the indices together without a comma separator

the default is comma

woah I did not know this, yeah I'll keep the comma I think 😅

Agree, explicit is better.

elasticmachine · 2024-12-11T10:43:07Z

💚 Build Succeeded

Buildkite Build
Commit: 75068c9
Storybooks Preview
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-197200-75068c947b54

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`aiops`	620	621	+1
`investigateApp`	571	587	+16
`observabilityAIAssistantApp`	379	414	+35
total			+52

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/es-types`	30	31	+1
`@kbn/sse-utils-server`	3	6	+3
`observabilityAIAssistant`	296	383	+87
`observabilityAIAssistantApp`	4	7	+3
`slo`	48	54	+6
total			+100

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`investigateApp`	481.6KB	480.5KB	-1.0KB
`observabilityAIAssistant`	19.3KB	19.3KB	+1.0B
`observabilityAIAssistantApp`	237.9KB	292.9KB	+54.9KB
total			+53.9KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`@kbn/inference-common`	1	3	+2
`@kbn/observability-ai-server`	-	1	+1
`observabilityAIAssistant`	28	30	+2
total			+5

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`inference`	6.5KB	7.4KB	+857.0B
`investigateApp`	6.5KB	11.1KB	+4.5KB
`observabilityAIAssistant`	48.3KB	48.4KB	+105.0B
`observabilityAIAssistantApp`	8.8KB	12.3KB	+3.5KB
total			+8.9KB

Unknown metric groups

API count

id	before	after	diff
`@kbn/es-types`	30	31	+1
`@kbn/inference-common`	132	136	+4
`@kbn/sse-utils-server`	4	7	+3
`observabilityAIAssistant`	298	385	+87
`observabilityAIAssistantApp`	4	8	+4
`slo`	48	54	+6
total			+105

async chunk count

id	before	after	diff
`observabilityAIAssistantApp`	6	7	+1

ESLint disabled line counts

id	before	after	diff
`@kbn/observability-utils-server`	0	1	+1
`observabilityAIAssistantApp`	10	9	-1
`slo`	17	16	-1
total			-1

Total ESLint disabled count

id	before	after	diff
`@kbn/observability-utils-server`	0	1	+1
`observabilityAIAssistantApp`	10	9	-1
`slo`	21	20	-1
total			-1

History

💛 Build #258928 was flaky b06d3f5
💔 Build #258769 failed ba5d761
💔 Build #258657 failed b953c20
💔 Build #258596 failed f3c9af2
💔 Build #258160 failed 68343a9

cc @sorenlouv

afharo

Core changes LGTM

afharo · 2024-12-11T11:34:48Z

packages/kbn-es-types/src/index.ts

 export type ESSearchRequest = estypes.SearchRequest;
+export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest;


nit: I'd love it if we could start highlighting that body is deprecated:

Suggested change

export type ESSearchRequest = estypes.SearchRequest;

export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest;

/** @deprecated Use ESSearchRequestWithoutBody instead */

export type ESSearchRequest = estypes.SearchRequest;

export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest;

In any case, I think this is a good change... as we'll be able to progressively migrate consumers of each type.

kibanamachine · 2024-12-11T11:35:21Z

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/12275530500

kibanamachine · 2024-12-11T11:40:42Z

💔 All backports failed

Status	Branch	Result
❌	8.x	Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 197200

Questions ?

Please refer to the Backport tool documentation

dgieselaar · 2024-12-11T11:49:21Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

Implements an LLM-based root cause analysis process. At a high level, it works by investigating entities - which means pulling in alerts, SLOs, and log patterns. From there, it can inspect related entities to get to the root cause. The backend implementation lives in `x-pack/packages/observability_utils-*` (`service_rca`). It can be imported into any server-side plugin and executed from there. The UI changes are mostly contained to `x-pack/plugins/observability_solution/observabillity_ai_assistant_app`. This plugin now exports a `RootCauseAnalysisContainer` which takes a stream of data that is returned by the root cause analysis process. The current implementation lives in the Investigate app. There, it calls its own endpoint that kicks off the RCA process, and feeds it into the `RootCauseAnalysisContainer` exposed by the Observability AI Assistant app plugin. I've left it in a route there so the investigation itself can be updated as the process runs - this would allow the user to close the browser and come back later, and see a full investigation. > [!NOTE] > Notes for reviewing teams > > @kbn/es-types: > - support both types and typesWithBodyKey > - simplify KeysOfSources type > > @kbn/server-route-repository: > - abortable streamed responses > > @kbn/sse-utils*: > - abortable streamed responses > - serialize errors in specific format for more reliable re-hydration of errors > - keep connection open with SSE comments > > @kbn/inference-*: > - export *Of variants of types, for easier manual inference > - add automated retries for `output` API > - add `name` to tool responses for type inference (get type of tool response via tool name) > - add `data` to tool responses for transporting internal data (not sent to the LLM) > - simplify `chunksIntoMessage` > - allow consumers of nlToEsql task to add to `system` prompt > - add toolCallId to validation error message > > @kbn/aiops*: > - export `categorizationAnalyzer` for use in observability-ai* > > @kbn/observability-ai-assistant* > - configurable limit (tokens or doc count) for knowledge base recall > > @kbn/slo*: > - export client that returns summary indices --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Maryam Saeidi <[email protected]> Co-authored-by: Bena Kansara <[email protected]> (cherry picked from commit fa1998c) # Conflicts: # .github/CODEOWNERS

Implements an LLM-based root cause analysis process. At a high level, it works by investigating entities - which means pulling in alerts, SLOs, and log patterns. From there, it can inspect related entities to get to the root cause. The backend implementation lives in `x-pack/packages/observability_utils-*` (`service_rca`). It can be imported into any server-side plugin and executed from there. The UI changes are mostly contained to `x-pack/plugins/observability_solution/observabillity_ai_assistant_app`. This plugin now exports a `RootCauseAnalysisContainer` which takes a stream of data that is returned by the root cause analysis process. The current implementation lives in the Investigate app. There, it calls its own endpoint that kicks off the RCA process, and feeds it into the `RootCauseAnalysisContainer` exposed by the Observability AI Assistant app plugin. I've left it in a route there so the investigation itself can be updated as the process runs - this would allow the user to close the browser and come back later, and see a full investigation. > [!NOTE] > Notes for reviewing teams > > @kbn/es-types: > - support both types and typesWithBodyKey > - simplify KeysOfSources type > > @kbn/server-route-repository: > - abortable streamed responses > > @kbn/sse-utils*: > - abortable streamed responses > - serialize errors in specific format for more reliable re-hydration of errors > - keep connection open with SSE comments > > @kbn/inference-*: > - export *Of variants of types, for easier manual inference > - add automated retries for `output` API > - add `name` to tool responses for type inference (get type of tool response via tool name) > - add `data` to tool responses for transporting internal data (not sent to the LLM) > - simplify `chunksIntoMessage` > - allow consumers of nlToEsql task to add to `system` prompt > - add toolCallId to validation error message > > @kbn/aiops*: > - export `categorizationAnalyzer` for use in observability-ai* > > @kbn/observability-ai-assistant* > - configurable limit (tokens or doc count) for knowledge base recall > > @kbn/slo*: > - export client that returns summary indices --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Maryam Saeidi <[email protected]> Co-authored-by: Bena Kansara <[email protected]>

# Backport This will backport the following commits from `main` to `8.x`: - [[RCA] AI-assisted root cause analysis (#197200)](#197200)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)

dgieselaar added 2 commits October 22, 2024 12:13

[RCA] AI-assisted root cause analysis

480a53b

Add log pattern table

7c4d7e0

mgiota self-requested a review October 22, 2024 14:03

dgieselaar and others added 5 commits October 22, 2024 20:08

Change points

d5a55e8

Simplify SLO client

64d6557

[CI] Auto-commit changed files from 'node scripts/build_plugin_list_d…

1158a39

…ocs'

[CI] Auto-commit changed files from 'node scripts/notice'

113776f

[CI] Auto-commit changed files from 'node scripts/yarn_deduplicate'

9db7634

dominiqueclarke self-requested a review October 22, 2024 18:52

dominiqueclarke reviewed Oct 23, 2024

View reviewed changes

dgieselaar added 4 commits October 23, 2024 07:31

Merge branch 'llm-rca-kubecon' of github.com:dgieselaar/kibana into l…

b34e36e

…lm-rca-kubecon

Fix quick checks

35438eb

Merge branch 'main' of github.com:elastic/kibana into llm-rca-kubecon

8b26417

Include alerts without grouping fields

9ba32f4

dominiqueclarke self-requested a review October 23, 2024 14:49

mgiota mentioned this pull request Oct 24, 2024

AI-assisted root cause analysis R&D #197591

Open

maryam-saeidi and others added 13 commits October 24, 2024 16:08

Fix test config

04e2e96

Merge branch 'main' into llm-rca-kubecon

f06779c

Fix test config

ed0a10f

lock timerange to be investigation time range

888a8e4

Improve process

fd6d246

Merge branch 'llm-rca-kubecon' of github.com:dgieselaar/kibana into l…

ff93d69

…lm-rca-kubecon-v2

v2.1

44bc6d5

Remove console.logging

924d2c3

Make sure to log all events after finishing

f147223

Partition related entity extraction

e07334b

Order by score asc (lower is better)

96d938a

Include documents from knowledge base

f680c97

Merge branch 'main' of github.com:elastic/kibana into llm-rca-kubecon

65ec3bd

Re-skip tests in Obs AI Assistant

b06d3f5

botelastic bot added ci:project-deploy-observability Create an Observability project Team:Obs AI Assistant Observability AI Assistant Team:obs-ux-management Observability Management User Experience Team labels Dec 10, 2024

Ikuni17 approved these changes Dec 11, 2024

View reviewed changes

jgowdyelastic approved these changes Dec 11, 2024

View reviewed changes

pgayvallet approved these changes Dec 11, 2024

View reviewed changes

crespocarlos reviewed Dec 11, 2024

View reviewed changes

dgieselaar added 2 commits December 11, 2024 10:37

Add tests for retry.onValidationError

62252d0

Clarify complexity

75068c9

sorenlouv approved these changes Dec 11, 2024

View reviewed changes

crespocarlos approved these changes Dec 11, 2024

View reviewed changes

dgieselaar enabled auto-merge (squash) December 11, 2024 11:34

afharo approved these changes Dec 11, 2024

View reviewed changes

dgieselaar merged commit fa1998c into elastic:main Dec 11, 2024
10 checks passed

dgieselaar mentioned this pull request Dec 11, 2024

[8.x] [RCA] AI-assisted root cause analysis (#197200) #203767

Merged

This was referenced Dec 11, 2024

Upgrade mocha to 10.3.0 #203500

Merged

Sustainable Kibana Architecture: Add scripts/relocate CLI (beta) #203803

Merged

[Obs UX Management] Migrate leftover code owner !! #203814

Merged

kibanamachine mentioned this pull request Dec 12, 2024

[Observability Onboarding] Remove legacy team ownership #203808

Merged

	}: CategorizeTextOptions & { changes?: boolean }): Promise<Array<FieldPatternResult<boolean>>> {
	}: CategorizeTextOptions & { includeChanges?: boolean }): Promise<Array<FieldPatternResult<boolean>>> {

	index: castArray(index).join(','),
	index: castArray(index).join(),

		export type ESSearchRequest = estypes.SearchRequest;
		export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest;

[RCA] AI-assisted root cause analysis #197200

[RCA] AI-assisted root cause analysis #197200

Conversation

dgieselaar commented Oct 22, 2024 • edited by kibanamachine Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgieselaar Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Dec 10, 2024

elasticmachine commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

🤖 GitHub comments

jgowdyelastic left a comment

Choose a reason for hiding this comment

pgayvallet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Dec 11, 2024 • edited Loading

💚 Build Succeeded

Metrics [docs]

Module Count

Public APIs missing comments

Async chunks

Public APIs missing exports

Page load bundle

API count

async chunk count

ESLint disabled line counts

Total ESLint disabled count

History

afharo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibanamachine commented Dec 11, 2024

kibanamachine commented Dec 11, 2024

💔 All backports failed

Manual backport

Questions ?

dgieselaar commented Dec 11, 2024

💚 All backports created successfully

Questions ?

dgieselaar commented Oct 22, 2024 •

edited by kibanamachine

Loading

dgieselaar Oct 23, 2024 •

edited

Loading

elasticmachine commented Dec 11, 2024 •

edited

Loading