Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RCA] AI-assisted root cause analysis #197200

Merged
merged 65 commits into from
Dec 11, 2024
Merged

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented Oct 22, 2024

Implements an LLM-based root cause analysis process. At a high level, it works by investigating entities - which means pulling in alerts, SLOs, and log patterns. From there, it can inspect related entities to get to the root cause.

The backend implementation lives in x-pack/packages/observability_utils-* (service_rca). It can be imported into any server-side plugin and executed from there.

The UI changes are mostly contained to x-pack/plugins/observability_solution/observabillity_ai_assistant_app. This plugin now exports a RootCauseAnalysisContainer which takes a stream of data that is returned by the root cause analysis process.

The current implementation lives in the Investigate app. There, it calls its own endpoint that kicks off the RCA process, and feeds it into the RootCauseAnalysisContainer exposed by the Observability AI Assistant app plugin. I've left it in a route there so the investigation itself can be updated as the process runs - this would allow the user to close the browser and come back later, and see a full investigation.

Note

Notes for reviewing teams

@kbn/es-types:

  • support both types and typesWithBodyKey
  • simplify KeysOfSources type

@kbn/server-route-repository:

  • abortable streamed responses

@kbn/sse-utils*:

  • abortable streamed responses
  • serialize errors in specific format for more reliable re-hydration of errors
  • keep connection open with SSE comments

@kbn/inference-*:

  • export *Of variants of types, for easier manual inference
  • add automated retries for output API
  • add name to tool responses for type inference (get type of tool response via tool name)
  • add data to tool responses for transporting internal data (not sent to the LLM)
  • simplify chunksIntoMessage
  • allow consumers of nlToEsql task to add to system prompt
  • add toolCallId to validation error message

@kbn/aiops*:

  • export categorizationAnalyzer for use in observability-ai*

@kbn/observability-ai-assistant*

  • configurable limit (tokens or doc count) for knowledge base recall

@kbn/slo*:

  • export client that returns summary indices

@mgiota mgiota self-requested a review October 22, 2024 14:03
@dominiqueclarke dominiqueclarke self-requested a review October 22, 2024 18:52
field
): [
string,
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate this type pulled out as it's difficult to process nested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, was being lazy there 😄

size,
categorization_analyzer: useMlStandardTokenizer
? {
tokenizer: 'ml_standard',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that we wanted to avoid the ml_standard tokenizer. I see there's been some discussions internally about that both with myself, you, and @weltenwort

I suppose it's fine if we're being very deliberate about when we use it via the useMLStandardTokenizer variable.

Copy link
Member Author

@dgieselaar dgieselaar Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested this a few times in the thread, the approach here is as follows:

  • use random sampling + standard tokenizer to get categories that are appearing most often
  • exclude these patterns in an unsampled second request (which is possible by using the standard tokenizer) to get patterns that appear less often and use the ML standard tokenizer which gives better results on low frequency patterns

Random sampling is disabled for now because of a bug, but I'll re-enable that and use a terms agg on _id instead of a top_hits agg, and then get the documents in a follow-up mget request.

FWIW, I think we're better off talking about these patterns as "less frequent" or "low volume" - when I think about "rare" I think about the ML function, where it means that a pattern appears that is "less rare" than before, ie, it is new, or only appearing slightly more than before. ie, with your strategy, we are not detecting a change, we are simply detecting low volume. With the ML function, we are detecting a change. I think "low frequency" is less useful than "rare'.

size,
start,
end,
}: CategorizeTextOptions & { changes?: boolean }): Promise<Array<FieldPatternResult<boolean>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}: CategorizeTextOptions & { changes?: boolean }): Promise<Array<FieldPatternResult<boolean>>> {
}: CategorizeTextOptions & { includeChanges?: boolean }): Promise<Array<FieldPatternResult<boolean>>> {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better indeed 👍

size: 1000,
changes,
samplingProbability: 1,
useMlStandardTokenizer: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use the standard tokenizer on the second query?

},
samplingProbability,
useMlStandardTokenizer: false,
size: 50,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50 seems somewhat small to consider the patterns found in the second query truly rare.

}) {
const alertsKuery = Object.entries(entity)
.map(([field, value]) => {
return `(${ALERT_GROUP_FIELD}:"${field}" AND ${ALERT_GROUP_VALUE}:"${value}")`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe some of the rules are missing adding entity fields, like service.name, under kibana.alert.group.field and kibana.alert.group.value, but do have the ecs fields like service.name in the AAD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

import { writeRcaReport } from './write_rca_report';
import { generateSignificantEventsTimeline, SignificantEventsTimeline } from './generate_timeline';

const tools = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain the concept of tools?

How do tools differ from function calling?

What does the schema for tools represent?

Could you provide an explanation of how tools are called, what data is available to a tool, and how to register new tools? Or point us in the direction of any documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logSources,
spaceId,
connectorId,
inferenceClient,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help our team understand the difference between integrating with the inference client and integrating with the obs ai assistant client?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inference client is (right now) a wrapper around the GenAI connectors and eventually around the ES _inference API. We'll gradually move to the inference plugin everywhere as its capabilities improve. Docs here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/inference/README.md

@botelastic botelastic bot added ci:project-deploy-observability Create an Observability project Team:Obs AI Assistant Observability AI Assistant Team:obs-ux-management Observability Management User Experience Team labels Dec 10, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ai-assistant (Team:Obs AI Assistant)

Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Member

@jgowdyelastic jgowdyelastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML changes LGTM

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple comments

Comment on lines 39 to 41
if (stream && retry?.onValidationError) {
throw new Error(`Retry options are not supported in streaming mode`);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: stream && retry?.onValidationError !== false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry?.onValidationError !== undefined ?

functionCalling,
stream: false,
retry: {
onValidationError: Number(retry.onValidationError) || false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to decrement that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I forgot to add a test too, so thanks

Comment on lines +233 to +234
change_point: change.change_point,
p_value: change.p_value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
change_point: change.change_point,
p_value: change.p_value,
changePoint: change.change_point,
pValue: change.p_value,

}

const patternsToExclude = topMessagePatterns.filter((pattern) => {
const complexity = pattern.regex.match(/(\.\+\?)|(\.\*\?)/g)?.length ?? 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain what this regex does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment - it counts the amount of capturing groups basically. ES will barf if there's too much of them in a regex query.

Comment on lines +345 to +347
complexity <= 25 &&
// anything less than 50 messages should be re-processed with the ml_standard tokenizer
pattern.count > 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create constants for these numbers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah hmm I'm not using them anywhere else, and in this case people can immediately see what the value is without moving up & down the file

const [field, value] = Object.entries(entity)[0];

const { terms } = await esClient.client.termsEnum({
index: castArray(index).join(','),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
index: castArray(index).join(','),
index: castArray(index).join(),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, no right? that will join the indices together without a comma separator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah I did not know this, yeah I'll keep the comma I think 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, explicit is better.

@elasticmachine
Copy link
Contributor

elasticmachine commented Dec 11, 2024

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
aiops 620 621 +1
investigateApp 571 587 +16
observabilityAIAssistantApp 379 414 +35
total +52

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/es-types 30 31 +1
@kbn/sse-utils-server 3 6 +3
observabilityAIAssistant 296 383 +87
observabilityAIAssistantApp 4 7 +3
slo 48 54 +6
total +100

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
investigateApp 481.6KB 480.5KB -1.0KB
observabilityAIAssistant 19.3KB 19.3KB +1.0B
observabilityAIAssistantApp 237.9KB 292.9KB +54.9KB
total +53.9KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/inference-common 1 3 +2
@kbn/observability-ai-server - 1 +1
observabilityAIAssistant 28 30 +2
total +5

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
inference 6.5KB 7.4KB +857.0B
investigateApp 6.5KB 11.1KB +4.5KB
observabilityAIAssistant 48.3KB 48.4KB +105.0B
observabilityAIAssistantApp 8.8KB 12.3KB +3.5KB
total +8.9KB
Unknown metric groups

API count

id before after diff
@kbn/es-types 30 31 +1
@kbn/inference-common 132 136 +4
@kbn/sse-utils-server 4 7 +3
observabilityAIAssistant 298 385 +87
observabilityAIAssistantApp 4 8 +4
slo 48 54 +6
total +105

async chunk count

id before after diff
observabilityAIAssistantApp 6 7 +1

ESLint disabled line counts

id before after diff
@kbn/observability-utils-server 0 1 +1
observabilityAIAssistantApp 10 9 -1
slo 17 16 -1
total -1

Total ESLint disabled count

id before after diff
@kbn/observability-utils-server 0 1 +1
observabilityAIAssistantApp 10 9 -1
slo 21 20 -1
total -1

History

cc @sorenlouv

@dgieselaar dgieselaar enabled auto-merge (squash) December 11, 2024 11:34
Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core changes LGTM

Comment on lines 29 to +30
export type ESSearchRequest = estypes.SearchRequest;
export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd love it if we could start highlighting that body is deprecated:

Suggested change
export type ESSearchRequest = estypes.SearchRequest;
export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest;
/** @deprecated Use ESSearchRequestWithoutBody instead */
export type ESSearchRequest = estypes.SearchRequest;
export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest;

In any case, I think this is a good change... as we'll be able to progressively migrate consumers of each type.

@dgieselaar dgieselaar merged commit fa1998c into elastic:main Dec 11, 2024
10 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/12275530500

@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.x Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 197200

Questions ?

Please refer to the Backport tool documentation

@dgieselaar
Copy link
Member Author

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

dgieselaar added a commit to dgieselaar/kibana that referenced this pull request Dec 11, 2024
Implements an LLM-based root cause analysis process. At a high level, it
works by investigating entities - which means pulling in alerts, SLOs,
and log patterns. From there, it can inspect related entities to get to
the root cause.

The backend implementation lives in
`x-pack/packages/observability_utils-*` (`service_rca`). It can be
imported into any server-side plugin and executed from there.

The UI changes are mostly contained to
`x-pack/plugins/observability_solution/observabillity_ai_assistant_app`.
This plugin now exports a `RootCauseAnalysisContainer` which takes a
stream of data that is returned by the root cause analysis process.

The current implementation lives in the Investigate app. There, it calls
its own endpoint that kicks off the RCA process, and feeds it into the
`RootCauseAnalysisContainer` exposed by the Observability AI Assistant
app plugin. I've left it in a route there so the investigation itself
can be updated as the process runs - this would allow the user to close
the browser and come back later, and see a full investigation.

> [!NOTE]
> Notes for reviewing teams
>
> @kbn/es-types:
> - support both types and typesWithBodyKey
> - simplify KeysOfSources type
>
> @kbn/server-route-repository:
> - abortable streamed responses
>
> @kbn/sse-utils*:
> - abortable streamed responses
> - serialize errors in specific format for more reliable re-hydration
of errors
> - keep connection open with SSE comments
>
> @kbn/inference-*:
> - export *Of variants of types, for easier manual inference
> - add automated retries for `output` API
> - add `name` to tool responses for type inference (get type of tool
response via tool name)
> - add `data` to tool responses for transporting internal data (not
sent to the LLM)
> - simplify `chunksIntoMessage`
> - allow consumers of nlToEsql task to add to `system` prompt
> - add toolCallId to validation error message
>
> @kbn/aiops*:
> - export `categorizationAnalyzer` for use in observability-ai*
>
> @kbn/observability-ai-assistant*
> - configurable limit (tokens or doc count) for knowledge base recall
>
> @kbn/slo*:
> - export client that returns summary indices

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Maryam Saeidi <[email protected]>
Co-authored-by: Bena Kansara <[email protected]>
(cherry picked from commit fa1998c)

# Conflicts:
#	.github/CODEOWNERS
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Dec 12, 2024
Implements an LLM-based root cause analysis process. At a high level, it
works by investigating entities - which means pulling in alerts, SLOs,
and log patterns. From there, it can inspect related entities to get to
the root cause.

The backend implementation lives in
`x-pack/packages/observability_utils-*` (`service_rca`). It can be
imported into any server-side plugin and executed from there.

The UI changes are mostly contained to
`x-pack/plugins/observability_solution/observabillity_ai_assistant_app`.
This plugin now exports a `RootCauseAnalysisContainer` which takes a
stream of data that is returned by the root cause analysis process.

The current implementation lives in the Investigate app. There, it calls
its own endpoint that kicks off the RCA process, and feeds it into the
`RootCauseAnalysisContainer` exposed by the Observability AI Assistant
app plugin. I've left it in a route there so the investigation itself
can be updated as the process runs - this would allow the user to close
the browser and come back later, and see a full investigation.

> [!NOTE]
> Notes for reviewing teams
> 
> @kbn/es-types:
> - support both types and typesWithBodyKey
> - simplify KeysOfSources type
> 
> @kbn/server-route-repository:
> - abortable streamed responses
> 
> @kbn/sse-utils*:
> - abortable streamed responses
> - serialize errors in specific format for more reliable re-hydration
of errors
> - keep connection open with SSE comments
> 
> @kbn/inference-*:
> - export *Of variants of types, for easier manual inference
> - add automated retries for `output` API
> - add `name` to tool responses for type inference (get type of tool
response via tool name)
> - add `data` to tool responses for transporting internal data (not
sent to the LLM)
> - simplify `chunksIntoMessage`
> - allow consumers of nlToEsql task to add to `system` prompt
> - add toolCallId to validation error message
> 
> @kbn/aiops*:
> - export `categorizationAnalyzer` for use in observability-ai*
> 
> @kbn/observability-ai-assistant*
> - configurable limit (tokens or doc count) for knowledge base recall
> 
> @kbn/slo*:
> - export client that returns summary indices

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Maryam Saeidi <[email protected]>
Co-authored-by: Bena Kansara <[email protected]>
dgieselaar added a commit that referenced this pull request Dec 12, 2024
# Backport

This will backport the following commits from `main` to `8.x`:
- [[RCA] AI-assisted root cause analysis
(#197200)](#197200)

<!--- Backport version: 7.3.2 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT {commits} BACKPORT-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:version Backport to applied version labels ci:project-deploy-observability Create an Observability project release_note:skip Skip the PR/issue when compiling release notes Team:Obs AI Assistant Observability AI Assistant Team:obs-ux-management Observability Management User Experience Team v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.