Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] ES|QL rule type improvements - write query results to the alert doc #184541

Merged
merged 19 commits into from
Jun 18, 2024

Conversation

doakalexi
Copy link
Contributor

@doakalexi doakalexi commented May 30, 2024

Resolves https://github.com/elastic/response-ops-team/issues/200

Summary

This PR copies the fields from the ES|QL query results to the alert doc. Now that we are writing everything to the alert doc I removed the source fields selector from the ES|QL ui.

Checklist

To verify

  • Create an ES Query ES|QL rule.
  • Go to dev tools and run the query below to verify that the query results are written to the alert doc.
GET .internal.alerts-stack.alerts-default*/_search

@doakalexi
Copy link
Contributor Author

/ci

@doakalexi
Copy link
Contributor Author

/ci

@doakalexi doakalexi changed the title Updating to write esql results to the alert doc [ResponseOps] ES|QL rule type improvements - write query results to the alert doc May 31, 2024
@doakalexi doakalexi added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) release_note:skip Skip the PR/issue when compiling release notes v8.15.0 labels May 31, 2024
@doakalexi
Copy link
Contributor Author

/ci

@doakalexi doakalexi marked this pull request as ready for review June 3, 2024 14:23
@doakalexi doakalexi requested a review from a team as a code owner June 3, 2024 14:23
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@doakalexi doakalexi requested review from pmuellr and umbopepato June 3, 2024 14:24
@@ -260,6 +250,7 @@ export default function ruleTests({ getService }: FtrProviderContext) {
expect(alertDoc['host.name']).to.eql(['host-1']);
expect(alertDoc['host.hostname']).to.eql(['host-1']);
expect(alertDoc['host.id']).to.eql(['1']);
expect(alertDoc.c[0]).greaterThan(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean a query can overwrite ANY field in the alertDoc? And I guess add new fields not defined in mappings? Neither of those seem right ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to acknowledge for the PR that we are talking offline about this 🙂

@umbopepato
Copy link
Member

@doakalexi I'm getting this error every time the test rule tries to save alerts:

image

These are the rule params I'm using:

image

@pmuellr
Copy link
Member

pmuellr commented Jun 4, 2024

Umbo points out ^^^ another thing we'll have to check. Is it an ECS field, that we allow you to update, and is it the right type? I'd think we'd want to exclude fields that are the wrong type. For example, if a number is expected, but the value being added is a string (that doesn't parse into a valid number - oh man, this is getting complicated!), then don't add that field to the alert.

@mikecote
Copy link
Contributor

mikecote commented Jun 4, 2024

Umbo points out ^^^ another thing we'll have to check. Is it an ECS field, that we allow you to update, and is it the right type? I'd think we'd want to exclude fields that are the wrong type. For example, if a number is expected, but the value being added is a string (that doesn't parse into a valid number - oh man, this is getting complicated!), then don't add that field to the alert.

Does the ignore_malformed help us in this scenario? #161465

@doakalexi
Copy link
Contributor Author

doakalexi commented Jun 4, 2024

Thanks for the feedback, I am sorry I went down a different path than we were expecting. I think things should be working as expected now.

I updated the changes in 8af0458 and c7ccaf8 to only write fields that we have mapped. We intersect result fields with mapped fields and copy array of values into alert field.

Currently, we do not map all ECS fields for ES Query alerts so I added the useEcs: true to the stack alerts AAD config.

I confirmed with @shanisagiv1 that we want to remove the ui code to select source fields for ES|QL.

@pmuellr
Copy link
Member

pmuellr commented Jun 5, 2024

Does the ignore_malformed help us in this scenario? #161465

Well, sure, but that will be likely very difficult to diagnose if we have weird problems where sometimes fields are accepted, and sometimes aren't.

I'd say if we need to do that for the field TYPE checking - providing non-numeric-parseable strings as numeric types - then ... fine. :-) I can imagine that it might be hard-to-impossible for us to do real type checking here.

But we should certainly be filtering out invalid / unwanted field names without depending on ignore_malformed ...

@doakalexi doakalexi requested a review from a team as a code owner June 5, 2024 17:33
@@ -36,8 +36,8 @@ export default function ({ getService }: FtrProviderContext) {
return fieldStat.name === 'geo_point';
}
);
expect(geoPointFieldStats.count).to.be(39);
expect(geoPointFieldStats.index_count).to.be(10);
expect(geoPointFieldStats.count).to.be(47);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why these values changed?

Copy link
Contributor Author

@doakalexi doakalexi Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They changed bc we are now mapping the ecs fields for the stack alert aad index

Copy link
Member

@umbopepato umbopepato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected, LGTM!

@pmuellr
Copy link
Member

pmuellr commented Jun 7, 2024

I haven't finished reviewing the new code (since last review), will do that next week.

I did try it out though. I used es-apm-sys-sim 1 4 -k to start generating some documents in the index es-apm-sys-sim. This will generate docs with 4 (from the 2nd param) host.name fields like host-1, host-2, and some other simulated metric data.

Then I built an ES|QL rule with query:

from es-apm-sys-sim |
rename system.cpu.total.norm.pct as host.cpu.usage |
limit 3

(the rename changes the tool-generated index's value to an ECS-named value)

When I look at the alert document, I see the following fields that got added per the PR:

host.name: ["host-4","host-3","host-1"],
host.cpu.usage: [ 0.5, 1 ]

The host.name is easy to understand. It picked 3 documents (of 1000's) that happened to have different host.name values. Lucky. But it did the "array-ization" correctly! (of course!) \o/

The host.cpu.usage is curious though. Turns out, host 1 had cpu of 1, and hosts 3 and 4 had 0.5. So it "flattened" the host.cpu.usage values. Seems like this must have been from our code, but not sure yet.

Even if we changed this to produce what I would expect of [0.5, 0.5, 1] (the missing 0.5 value), the results are still of questionable value. I'm sure you can build some useful alerts with it in this shape, but I'm guessing people will want to see the tie between the host name and cpu usage, which you can't today.

So - maybe a bug in the code where it didn't produce 3 values for cpu usage. And a longer term question of how useful it will be to do this simple aggregation of matching field values.

@doakalexi
Copy link
Contributor Author

So - maybe a bug in the code where it didn't produce 3 values for cpu usage. And a longer term question of how useful it will be to do this simple aggregation of matching field values.

Based on the issue, I think flattening the values is how it's supposed to work. @mikecote or @shanisagiv1 maybe can confirm what behavior we want here

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM, but have some discussion topics in the PR thread ...

},
resultLimit: alertLimit,
sourceFieldsParams: params.sourceFields,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sourceFields parameter now completely unused, by all the flavors of es query? Not clear too me, but seems like it.

If it is, we'll want to "remove" it, though how we do that in a ZDT/BWC safe way is not clear to me. We may have to leave it in the params forever, or something. But was wondering about the <SourceFields> element as well - presumably we can get rid of that?

I think for now, adding a comment to where sourceFields is defined as a param, indicating the parameter is now ignored, would be good. And then open up a separate issue to "clean up" any remaining references to it, as best we can - unless you know we don't have anything to clean up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still used by the other es query types, just removed for ES|QL. I will add a comment to say the param is ignored now for ES|QL

Copy link
Contributor Author

@doakalexi doakalexi Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in this commit e84f295

@@ -61,12 +58,11 @@ export const EsqlQueryExpression: React.FC<
groupBy: DEFAULT_VALUES.GROUP_BY,
termSize: DEFAULT_VALUES.TERM_SIZE,
searchType: SearchType.esqlQuery,
sourceFields: sourceFields ?? DEFAULT_VALUES.SOURCE_FIELDS,
sourceFields: DEFAULT_VALUES.SOURCE_FIELDS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_VALUES.SOURCE_FIELDS seems to be [], and this seems to be the only reference to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh true, I can remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in this commit e84f295

@pmuellr
Copy link
Member

pmuellr commented Jun 10, 2024

I tried this out as follows. I used es-apm-sys-sim 1 4 -k to start generating some documents in the index es-apm-sys-sim. This will generate docs with 4 (from the 2nd param) host.name fields like host-1, host-2, and some other simulated metric data.

Then I built an ES|QL rule with query:

from es-apm-sys-sim |
rename system.cpu.total.norm.pct as host.cpu.usage |
limit 3

(the rename changes the tool-generated field name that has values, to an ECS name)

When I look at the alert document, I see the following fields that got added per the PR:

host.name: ["host-4","host-3","host-1"],
host.cpu.usage: [ 0.5, 1 ]

The host.name is easy to understand. It picked 3 documents (of 1000's) that happened to have different host.name values. Lucky. But it did the "array-ization" correctly.

The host.cpu.usage is curious though. Turns out, host 1 had cpu of 1, and hosts 3 and 4 had 0.5. So it "flattened" the host.cpu.usage values. Seems like this must have been from our code, but not sure yet.

Even if we changed this to produce what I would expect of [0.5, 0.5, 1] (the missing 0.5 value), the results are still of questionable usage. I'm sure you can build some useful alerts with it in this shape, but I'm guessing people will want to see the tie between the host name and cpu usage, which you can't today.

So - maybe a bug in the code where it didn't produce 3 values for cpu usage? I'm not sure this is a big deal, since having the somewhat "mess" of ECS values isn't going to be precise enough anyway - does it really matter if they are flattened as well as not really organized? I also couldn't find (quickly) where this might be done in our code, but assume it's our code and not an artifact of ES|QL ...

And a longer term question of figuring how to get these values organized better ... thinking we should solve this when we add "grouping", where we have similar issues to tackle.

@doakalexi
Copy link
Contributor Author

Even if we changed this to produce what I would expect of [0.5, 0.5, 1] (the missing 0.5 value), the results are still of questionable usage. I'm sure you can build some useful alerts with it in this shape, but I'm guessing people will want to see the tie between the host name and cpu usage, which you can't today.

So - maybe a bug in the code where it didn't produce 3 values for cpu usage? I'm not sure this is a big deal, since having the somewhat "mess" of ECS values isn't going to be precise enough anyway - does it really matter if they are flattened as well as not really organized? I also couldn't find (quickly) where this might be done in our code, but assume it's our code and not an artifact of ES|QL ...

And a longer term question of figuring how to get these values organized better ... thinking we should solve this when we add "grouping", where we have similar issues to tackle.

Yeah I definitely see that it might not be as useful if you can't see the link between host.name and cpu usage. This code is in https://github.com/elastic/kibana/blob/main/x-pack/plugins/triggers_actions_ui/common/data/lib/parse_aggregation_results.ts#L88. If we don't want to flatten I can go ahead and change this. I am wondering if @mikecote or @shanisagiv1 have any opinions or ideas, bc the example in the issue is flattened.

@mikecote
Copy link
Contributor

What does the alternative look like if we don't flatten, do we create new fields?

@doakalexi
Copy link
Contributor Author

doakalexi commented Jun 12, 2024

What does the alternative look like if we don't flatten, do we create new fields?

Using patrick's example it would write [0.5, 0.5, 1] instead of [0.5, 1] for cpu usage

@doakalexi
Copy link
Contributor Author

hi @nreese can I get a review on the test that changed in this PR when you get a chance pls

@pmuellr
Copy link
Member

pmuellr commented Jun 12, 2024

Yeah I definitely see that it might not be as useful if you can't see the link between host.name and cpu usage

I think that's probably a decent reason to flatten. Since we can't get the "paired" fields together, it might be nicer to get the flattened version, having it be less "noisy". Once we figure out grouping, hopefully we'll have a little better story :-)

We should definitely doc this behavior - and I forgot to mention, we need to doc how this works in general, I think? Not sure where we talk about the data in alerts. Seems like we'd want to provide an example of converting a field to an alert-doc'able field. Fine as a follow-on issue.

Copy link
Contributor

@nreese nreese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kibana-gis changes LGTM
code review only

@doakalexi
Copy link
Contributor Author

Yeah I definitely see that it might not be as useful if you can't see the link between host.name and cpu usage

I think that's probably a decent reason to flatten. Since we can't get the "paired" fields together, it might be nicer to get the flattened version, having it be less "noisy". Once we figure out grouping, hopefully we'll have a little better story :-)

We should definitely doc this behavior - and I forgot to mention, we need to doc how this works in general, I think? Not sure where we talk about the data in alerts. Seems like we'd want to provide an example of converting a field to an alert-doc'able field. Fine as a follow-on issue.

Yeah I think that makes sense too. Thanks Patrick!

@doakalexi
Copy link
Contributor Author

doakalexi commented Jun 17, 2024

Hi y'all, I got confirmation from @shanisagiv1 that we don't want to flatten the data. I updated the code in this commit 4f46742. Thanks for your help on this!

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #6 / ManualRuleRunModal should render confirmation button disabled if invalid time range has been selected
  • [job] [logs] Jest Tests #6 / ManualRuleRunModal should render confirmation button disabled if selected end date is in future
  • [job] [logs] Jest Tests #6 / ManualRuleRunModal should render confirmation button disabled if selected start date is more than 90 days in the past

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
stackAlerts 79.7KB 75.0KB -4.7KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
stackAlerts 24.8KB 24.8KB -17.0B
triggersActionsUi 114.4KB 114.4KB -4.0B
total -21.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@doakalexi doakalexi merged commit a0fb706 into elastic:main Jun 18, 2024
21 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jun 18, 2024
bhapas pushed a commit to bhapas/kibana that referenced this pull request Jun 18, 2024
…he alert doc (elastic#184541)

Resolves elastic/response-ops-team#200

## Summary

This PR copies the fields from the ES|QL query results to the alert doc.
Now that we are writing everything to the alert doc I removed the source
fields selector from the ES|QL ui.


### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### To verify

- Create an ES Query ES|QL rule.
- Go to [dev tools](http://localhost:5601/app/dev_tools#/console) and
run the query below to verify that the query results are written to the
alert doc.
```
GET .internal.alerts-stack.alerts-default*/_search
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants