-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RCA] AI-assisted root cause analysis #197200
Conversation
field | ||
): [ | ||
string, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate this type pulled out as it's difficult to process nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, was being lazy there 😄
size, | ||
categorization_analyzer: useMlStandardTokenizer | ||
? { | ||
tokenizer: 'ml_standard', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that we wanted to avoid the ml_standard tokenizer. I see there's been some discussions internally about that both with myself, you, and @weltenwort
I suppose it's fine if we're being very deliberate about when we use it via the useMLStandardTokenizer
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested this a few times in the thread, the approach here is as follows:
- use random sampling + standard tokenizer to get categories that are appearing most often
- exclude these patterns in an unsampled second request (which is possible by using the standard tokenizer) to get patterns that appear less often and use the ML standard tokenizer which gives better results on low frequency patterns
Random sampling is disabled for now because of a bug, but I'll re-enable that and use a terms agg on _id instead of a top_hits agg, and then get the documents in a follow-up mget request.
FWIW, I think we're better off talking about these patterns as "less frequent" or "low volume" - when I think about "rare" I think about the ML function, where it means that a pattern appears that is "less rare" than before, ie, it is new, or only appearing slightly more than before. ie, with your strategy, we are not detecting a change, we are simply detecting low volume. With the ML function, we are detecting a change. I think "low frequency" is less useful than "rare'.
size, | ||
start, | ||
end, | ||
}: CategorizeTextOptions & { changes?: boolean }): Promise<Array<FieldPatternResult<boolean>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}: CategorizeTextOptions & { changes?: boolean }): Promise<Array<FieldPatternResult<boolean>>> { | |
}: CategorizeTextOptions & { includeChanges?: boolean }): Promise<Array<FieldPatternResult<boolean>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better indeed 👍
size: 1000, | ||
changes, | ||
samplingProbability: 1, | ||
useMlStandardTokenizer: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use the standard tokenizer on the second query?
}, | ||
samplingProbability, | ||
useMlStandardTokenizer: false, | ||
size: 50, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50 seems somewhat small to consider the patterns found in the second query truly rare.
}) { | ||
const alertsKuery = Object.entries(entity) | ||
.map(([field, value]) => { | ||
return `(${ALERT_GROUP_FIELD}:"${field}" AND ${ALERT_GROUP_VALUE}:"${value}")`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe some of the rules are missing adding entity fields, like service.name
, under kibana.alert.group.field
and kibana.alert.group.value
, but do have the ecs fields like service.name
in the AAD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
import { writeRcaReport } from './write_rca_report'; | ||
import { generateSignificantEventsTimeline, SignificantEventsTimeline } from './generate_timeline'; | ||
|
||
const tools = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain the concept of tools?
How do tools differ from function calling?
What does the schema for tools represent?
Could you provide an explanation of how tools are called, what data is available to a tool, and how to register new tools? Or point us in the direction of any documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tools are function calling! we have docs here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/observability_ai_assistant/README.md
logSources, | ||
spaceId, | ||
connectorId, | ||
inferenceClient, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please help our team understand the difference between integrating with the inference client and integrating with the obs ai assistant client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the inference
client is (right now) a wrapper around the GenAI connectors and eventually around the ES _inference API. We'll gradually move to the inference plugin everywhere as its capabilities improve. Docs here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/inference/README.md
…lm-rca-kubecon-v2
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management) |
Pinging @elastic/obs-ai-assistant (Team:Obs AI Assistant) |
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple comments
if (stream && retry?.onValidationError) { | ||
throw new Error(`Retry options are not supported in streaming mode`); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: stream && retry?.onValidationError !== false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retry?.onValidationError !== undefined
?
functionCalling, | ||
stream: false, | ||
retry: { | ||
onValidationError: Number(retry.onValidationError) || false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to decrement that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I forgot to add a test too, so thanks
change_point: change.change_point, | ||
p_value: change.p_value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe?
change_point: change.change_point, | |
p_value: change.p_value, | |
changePoint: change.change_point, | |
pValue: change.p_value, |
} | ||
|
||
const patternsToExclude = topMessagePatterns.filter((pattern) => { | ||
const complexity = pattern.regex.match(/(\.\+\?)|(\.\*\?)/g)?.length ?? 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what this regex does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment - it counts the amount of capturing groups basically. ES will barf if there's too much of them in a regex query.
complexity <= 25 && | ||
// anything less than 50 messages should be re-processed with the ml_standard tokenizer | ||
pattern.count > 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe create constants for these numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah hmm I'm not using them anywhere else, and in this case people can immediately see what the value is without moving up & down the file
const [field, value] = Object.entries(entity)[0]; | ||
|
||
const { terms } = await esClient.client.termsEnum({ | ||
index: castArray(index).join(','), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
index: castArray(index).join(','), | |
index: castArray(index).join(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, no right? that will join the indices together without a comma separator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default is comma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woah I did not know this, yeah I'll keep the comma I think 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, explicit is better.
💚 Build Succeeded
Metrics [docs]Module Count
Public APIs missing comments
Async chunks
Public APIs missing exports
Page load bundle
Unknown metric groupsAPI count
async chunk count
ESLint disabled line counts
Total ESLint disabled count
History
cc @sorenlouv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core changes LGTM
export type ESSearchRequest = estypes.SearchRequest; | ||
export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd love it if we could start highlighting that body
is deprecated:
export type ESSearchRequest = estypes.SearchRequest; | |
export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest; | |
/** @deprecated Use ESSearchRequestWithoutBody instead */ | |
export type ESSearchRequest = estypes.SearchRequest; | |
export type ESSearchRequestWithoutBody = estypesWithoutBody.SearchRequest; |
In any case, I think this is a good change... as we'll be able to progressively migrate consumers of each type.
Starting backport for target branches: 8.x https://github.com/elastic/kibana/actions/runs/12275530500 |
💔 All backports failed
Manual backportTo create the backport manually run:
Questions ?Please refer to the Backport tool documentation |
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
Implements an LLM-based root cause analysis process. At a high level, it works by investigating entities - which means pulling in alerts, SLOs, and log patterns. From there, it can inspect related entities to get to the root cause. The backend implementation lives in `x-pack/packages/observability_utils-*` (`service_rca`). It can be imported into any server-side plugin and executed from there. The UI changes are mostly contained to `x-pack/plugins/observability_solution/observabillity_ai_assistant_app`. This plugin now exports a `RootCauseAnalysisContainer` which takes a stream of data that is returned by the root cause analysis process. The current implementation lives in the Investigate app. There, it calls its own endpoint that kicks off the RCA process, and feeds it into the `RootCauseAnalysisContainer` exposed by the Observability AI Assistant app plugin. I've left it in a route there so the investigation itself can be updated as the process runs - this would allow the user to close the browser and come back later, and see a full investigation. > [!NOTE] > Notes for reviewing teams > > @kbn/es-types: > - support both types and typesWithBodyKey > - simplify KeysOfSources type > > @kbn/server-route-repository: > - abortable streamed responses > > @kbn/sse-utils*: > - abortable streamed responses > - serialize errors in specific format for more reliable re-hydration of errors > - keep connection open with SSE comments > > @kbn/inference-*: > - export *Of variants of types, for easier manual inference > - add automated retries for `output` API > - add `name` to tool responses for type inference (get type of tool response via tool name) > - add `data` to tool responses for transporting internal data (not sent to the LLM) > - simplify `chunksIntoMessage` > - allow consumers of nlToEsql task to add to `system` prompt > - add toolCallId to validation error message > > @kbn/aiops*: > - export `categorizationAnalyzer` for use in observability-ai* > > @kbn/observability-ai-assistant* > - configurable limit (tokens or doc count) for knowledge base recall > > @kbn/slo*: > - export client that returns summary indices --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Maryam Saeidi <[email protected]> Co-authored-by: Bena Kansara <[email protected]> (cherry picked from commit fa1998c) # Conflicts: # .github/CODEOWNERS
Implements an LLM-based root cause analysis process. At a high level, it works by investigating entities - which means pulling in alerts, SLOs, and log patterns. From there, it can inspect related entities to get to the root cause. The backend implementation lives in `x-pack/packages/observability_utils-*` (`service_rca`). It can be imported into any server-side plugin and executed from there. The UI changes are mostly contained to `x-pack/plugins/observability_solution/observabillity_ai_assistant_app`. This plugin now exports a `RootCauseAnalysisContainer` which takes a stream of data that is returned by the root cause analysis process. The current implementation lives in the Investigate app. There, it calls its own endpoint that kicks off the RCA process, and feeds it into the `RootCauseAnalysisContainer` exposed by the Observability AI Assistant app plugin. I've left it in a route there so the investigation itself can be updated as the process runs - this would allow the user to close the browser and come back later, and see a full investigation. > [!NOTE] > Notes for reviewing teams > > @kbn/es-types: > - support both types and typesWithBodyKey > - simplify KeysOfSources type > > @kbn/server-route-repository: > - abortable streamed responses > > @kbn/sse-utils*: > - abortable streamed responses > - serialize errors in specific format for more reliable re-hydration of errors > - keep connection open with SSE comments > > @kbn/inference-*: > - export *Of variants of types, for easier manual inference > - add automated retries for `output` API > - add `name` to tool responses for type inference (get type of tool response via tool name) > - add `data` to tool responses for transporting internal data (not sent to the LLM) > - simplify `chunksIntoMessage` > - allow consumers of nlToEsql task to add to `system` prompt > - add toolCallId to validation error message > > @kbn/aiops*: > - export `categorizationAnalyzer` for use in observability-ai* > > @kbn/observability-ai-assistant* > - configurable limit (tokens or doc count) for knowledge base recall > > @kbn/slo*: > - export client that returns summary indices --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Maryam Saeidi <[email protected]> Co-authored-by: Bena Kansara <[email protected]>
# Backport This will backport the following commits from `main` to `8.x`: - [[RCA] AI-assisted root cause analysis (#197200)](#197200) <!--- Backport version: 7.3.2 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT {commits} BACKPORT-->
Implements an LLM-based root cause analysis process. At a high level, it works by investigating entities - which means pulling in alerts, SLOs, and log patterns. From there, it can inspect related entities to get to the root cause.
The backend implementation lives in
x-pack/packages/observability_utils-*
(service_rca
). It can be imported into any server-side plugin and executed from there.The UI changes are mostly contained to
x-pack/plugins/observability_solution/observabillity_ai_assistant_app
. This plugin now exports aRootCauseAnalysisContainer
which takes a stream of data that is returned by the root cause analysis process.The current implementation lives in the Investigate app. There, it calls its own endpoint that kicks off the RCA process, and feeds it into the
RootCauseAnalysisContainer
exposed by the Observability AI Assistant app plugin. I've left it in a route there so the investigation itself can be updated as the process runs - this would allow the user to close the browser and come back later, and see a full investigation.Note
Notes for reviewing teams
@kbn/es-types:
@kbn/server-route-repository:
@kbn/sse-utils*:
@kbn/inference-*:
output
APIname
to tool responses for type inference (get type of tool response via tool name)data
to tool responses for transporting internal data (not sent to the LLM)chunksIntoMessage
system
prompt@kbn/aiops*:
categorizationAnalyzer
for use in observability-ai*@kbn/observability-ai-assistant*
@kbn/slo*: