Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack Monitoring + Kibana Alerting #45571

Open
1 of 3 tasks
chrisronline opened this issue Sep 12, 2019 · 17 comments
Open
1 of 3 tasks

Stack Monitoring + Kibana Alerting #45571

chrisronline opened this issue Sep 12, 2019 · 17 comments
Assignees
Labels

Comments

@chrisronline
Copy link
Contributor

chrisronline commented Sep 12, 2019

This issue is to serve as a list of issues/questions/comments/blockers for the Stack Monitoring team as they investigate migrating existing watches to Kibana alerting, with the goal of feature and functional parity.

Blockers

  • There is currently no way to create alerts without using the HTTP api. This is not great for Stack Monitoring as we need to create default alerts for all users who enable monitoring. (Related [DISCUSS] Alerting + Security #36836)
  • We need a default email action to exist so our default alerts can send an email to a configured (through kibana.yml) email without needing to do any SMTP setup. See this comment
  • We need to be able to register/run alerts in a space-agnostic sense
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring

@cachedout
Copy link
Contributor

I'm going to loop in a few people on the Kibana Alerting team who will probably be interested in following this issue:

@peterschretlen @mikecote @pmuellr

@mikecote mikecote self-assigned this Sep 13, 2019
@mikecote
Copy link
Contributor

Had a chat with @pmuellr to discuss some options. We will discuss these with @bmcconaghy and @kobelb next. Some notes:

  • In order to access the alertsClient and actionsClient we would need the credentials for a user (ex: stack monitoring user). We can then create the alerts and actions on their behalf.
  • We could make action types able to execute directly without requiring an action saved object. This would make more sense for the action types that don't need config / secrets (ex: server log, direct email, etc).
  • We could add a flag within the action types / alert types to indicate executeAsSystem or something. This would tell the framework not to use API keys and to not provide a savedObjectsClient and callCluster to the executor (those pieces are the only ones user specific). System alerts could only execute system actions. User alerts could execute system actions and user actions.
  • Would have to come up with a system when multiple Kibana instances are running so only one can create the alert and actions.

@peterschretlen
Copy link
Contributor

After some discussion I think there are 2 main issues:

  1. Security. How can we initiate stack monitoring alerts tied to a user with the right privileges? We want to avoid tying alerts or actions to the kibana_system role user - they should be tied to either the user doing the configuration, or to a service account. The system user can’t generate API keys itself, and it’s preferable not to augment its access.
  2. Spaces. Alerts and actions are designed to be space-aware, and in general Kibana features should be space aware. I think the issue is what does it mean/look like to have Stack Monitoring be space-aware and how do we transition from space-agnostic?

I think the first point (tying the actions and alerts to a user) implies some UI setup step. Maybe it is clicking an ‘turn on alerting’ button (or maybe attaching it to the existing ’turn on monitoring" button). That would actually address both points:

  • allow the default email action and the alerts to be tied to an API key associated with the user or service account doing the setup. The config details could still be in a yaml (though the UI I think would be more cloud-friendly vs yaml settings?), but the alerts and actions would not be created until initiated by a user.
  • allows the user to choose the space in which alerts appear (the space they are currently in). This implies that going to another space and turn on alerts again would create redundant alerts, but I think that is OK, maybe even desirable long term once alerts can be configured.

This does imply breaking changes (config and a move away from space-agnostic), but I think through 7.x we’d need an opt-in strategy for Kibana alerting in Stack Monitoring anyway.

@chrisronline @cachedout would be curious to hear your thoughts on having a UI initiated setup step. With that assumption I think we can address the current list of blockers.

Perhaps there are non-UI options, but I think they'd still have to involve a user account and spaces.

@cachedout
Copy link
Contributor

Hi @peterschretlen. Thanks for the response to this.

I agree with the way you've categorized the issues here. I'll begin with a discussion of Spaces.

The transition to Alerting in my mind, has always been the right point for Stack Monitoring to transition to being Spaces-aware. I'm not sure how much we want to get into that in this discussion, but for an initial round of changes, I could see something like the following:

A role that has the 'Read' privilege:

  • Cannot enable monitoring if it is not already enabled
  • Cannot CRUD alerts

A role that has the 'All' privilege, obviously would not have the above restrictions. For the CRUD of the alerts themselves, I'm thinking that we'd want to mirror what I understand the alerting privilege model to be -- which is to tie the ability to modify a given alert to the privileges granted by the role/space the current user is in. (Please correct me if this understanding is incorrect!)

Regarding the initial setup step, if I understand your proposal correctly, we'd end up with a release where existing watcher-based alerts were not "migrated" to the new Alerting platform although, a setup process could be initiated which could recreate them in the new Alerting platform. In truth, the more I think about this initial "alerting setup" step, the more I like it. Not only does it provide a clean way to recreate old watcher-based alerts, it also provides a means for us to give the user an experience to potentially select existing metrics for alerts as well. I think I can start to see the shape of how that would look but I'm really interested in what @chrisronline thinks here.

Regarding the point about these initial alerts essentially being tied to the user who initiates the action, I think this is fine, so long as we allow for the possibility that the "setup step" could be re-initiated at any point so that if they didn't intend this behavior, they could switch to the account they intended for long-term alert management. Alternatively, perhaps the Kibana alert management application could allow a super-user to migrate alert ownership between users? I don't know if that's been discussed at all but could provide another means for a user to address this concern if it came up.

@peterschretlen
Copy link
Contributor

Regarding the point about these initial alerts essentially being tied to the user who initiates the action, I think this is fine, so long as we allow for the possibility that the "setup step" could be re-initiated at any point so that if they didn't intend this behavior, they could switch to the account they intended for long-term alert management. Alternatively, perhaps the Kibana alert management application could allow a super-user to migrate alert ownership between users? I don't know if that's been discussed at all but could provide another means for a user to address this concern if it came up.

Good point, makes sense there should be a way to reset the alerts/re-run the setup. Regarding ownership and transfer, we do have createdBy and updatedBy fields as of #41389. Right now any editing of the alert ends up generating a new API key as the editor, effectively transferring ownership. So a no-op update could be used to transfer ownership. This might be another operation found in the centralized alert listing ( cc @mdefazio ). Though I think initially just being able to reset or re-run setup would probably be sufficient.

@chrisronline
Copy link
Contributor Author

I want to clarify something:

We need a default email action to exist so our default alerts can send an email to a configured (through kibana.yml) email without needing to do any SMTP setup.

In my original post, this is actually not accurate. For cluster alerts to work, the user needs to configure an outgoing SMTP server through xpack.notification.email settings in elasticsearch.yml. This is actually more setup required than I originally thought, so I think we can disregard that as a blocker item (I'm going to edit the description, mark it done, and link to this comment) since we can just move this setup to Kibana.

@chrisronline
Copy link
Contributor Author

I think we need to talk about a few things here:

  1. Net new user experience, where they have no cluster alerts
  2. Existing user experience, where they are currently using cluster alerts
  3. Alerting management within spaces

Net new user experience, where they have no cluster alerts

think the overall experience will be close. The main difference in the set up process will be the need to hit an API endpoint/click a UI button that will create the cluster alerts, using a given user's credentials and provided space id. I think an API endpoint is important here so users can automate this setup (just like they can currently with cluster alerts) - it's important to note that an API endpoint should be available to support the creation of the email action (and appropriate SMTP configuration) which the cluster alerts will use.

Existing user experience, where they are currently using cluster alerts

We need to ensure that we can properly disable existing cluster alerts before starting to run new Kibana cluster alerts. This might be tricky, as we don't have the necessary permission set for the monitoring_user to interact with watch data (we only have the ability to read from .monitoring-alerts-* indices which cluster alerts write to). Curious to @cachedout's thoughts here.

Alerting management within spaces

First off, I don't think Stack Monitoring is a good fit for spaces. AFAIK, spaces exist as a way to segment data to various users to ensure users only see the data that matters to them (for example, they can only see the list of dashboards that affects their part of the organization instead of needing to filter/search for the ones they want in a giant list of dashboards). We don't really have anything like that currently in Stack Monitoring - I think the assumption that a single type of user is the only user accessing Stack Monitoring is a safe bet. Could we imagine some scenarios where some users might want to only see certain monitoring data? Sure, but I don't think those are requests we hear from users (please correct me here if there is supporting data). I don't know why we need to fit a square into a circle, which it feels like we are doing here.

Secondly, let's say we do integrate into spaces. It's a strange experience for users to create cluster alerts in one space, and not have the ability to see that in another space. This could easily create duplication of alerts (which @peterschretlen mentioned earlier) which feels like it will confusing to users.

Let's jump forward a bit in our roadmap and imagine we have customizable alerts in monitoring for various metrics we collect. For example, we have a CPU threshold alert on all nodes in our ES cluster at 90%. If this is configured in space A and a user has access to space A and space B, it seems likely they could go to space B (for maybe another purpose), go to Stack Monitoring, and not see that alert. Maybe they'd wonder if it was deleted and try and recreate it? That feels like a confusing experience.

Thoughts on this @cachedout?

@cachedout
Copy link
Contributor

cachedout commented Sep 24, 2019

This might be tricky, as we don't have the necessary permission set for the monitoring_user to interact with watch data (we only have the ability to read from .monitoring-alerts-* indices which cluster alerts write to). Curious to @cachedout's thoughts here.

My thinking here has been that we try to do this by using the blacklist setting:

cluster_alerts.management.blacklist

    Prevents the creation of specific cluster alerts. It also removes any applicable watches that already exist in the current cluster.

    You can add any of the following watch identifiers to the blacklist:

        elasticsearch_cluster_status
        elasticsearch_version_mismatch
        elasticsearch_nodes
        kibana_version_mismatch
        logstash_version_mismatch
        xpack_license_expiration

    For example: ["elasticsearch_version_mismatch","xpack_license_expiration"].

To test this, I used this query: GET /.watches/_search?filter_path=hits.total.value

Then to remove a few cluster alerts, I used this:

PUT /_cluster/settings
{
  "transient": {
    "xpack" : {
      "monitoring" : {
        "exporters": {
          "local": {
            "type": "local",
            "cluster_alerts.management.blacklist": ["elasticsearch_cluster_status", "xpack_license_expiration", "kibana_version_mismatch",
              "logstash_version_mismatch"]
            }
          }
        }
      }
    }
}

I then re-ran GET /.watches/_search?filter_path=hits.total.value and observed three fewer watches in the index.

WDYT @chrisronline ?

@cachedout
Copy link
Contributor

We don't really have anything like that currently in Stack Monitoring - I think the assumption that a single type of user is the only user accessing Stack Monitoring is a safe bet. Could we imagine some scenarios where some users might want to only see certain monitoring data?

I can see this need in the future. There are a few things that lead me to believe it's desirable to use Spaces for allow users to segment monitoring data by space.

  1. Right now it's a "Gold" feature for people to be able to send multiple clusters to a single dedicated monitoring cluster. Given that we're considering this ability a premium feature, to me, it's a natural follow-on that we should give users the ability to set privileges around which cluster is viewable from which space.

  2. There are some outstanding requests for this. Here are a few:

https://github.com/elastic/enhancements/issues/6894
https://github.com/elastic/enhancements/issues/4146

To your point, though, this isn't a feature that's requested a lot but we do get requests for it it here and there. I suspect, too, that some use cases are going to emerge for this, over time, especially as we start to add additional services like Site Search into Stack Monitoring.

Let's jump forward a bit in our roadmap and imagine we have customizable alerts in monitoring for various metrics we collect. For example, we have a CPU threshold alert on all nodes in our ES cluster at 90%. If this is configured in space A and a user has access to space A and space B, it seems likely they could go to space B (for maybe another purpose), go to Stack Monitoring, and not see that alert. Maybe they'd wonder if it was deleted and try and recreate it? That feels like a confusing experience.

I agree that this is a potential problem. I would also like to understand what happens if a space is removed. Do all the alerts get removed along with it?

TBH, it feels like what's required here is to not have an alert tied to one and only one space, but to be able to select which spaces an alert appears in, and perhaps to have this be all spaces by default. That said, I don't know enough about how @peterschretlen and @mikecote are thinking about this model and I would be quite interested to hear their thoughts.

@chrisronline
Copy link
Contributor Author

My thinking here has been that we try to do this by using the blacklist setting:

Interesting. I haven't used that before. I can play with it and verify, but that might work. I wonder how this will play out when internal collection is disabled. Hopefully we can be fully migrated to Kibana alerting before that and it's not a concern, but something to consider.

@peterschretlen
Copy link
Contributor

peterschretlen commented Sep 25, 2019

I agree that this is a potential problem. I would also like to understand what happens if a space is removed. Do all the alerts get removed along with it?

Yes, deleting a space will delete all the objects in that space, including the alerts.

TBH, it feels like what's required here is to not have an alert tied to one and only one space, but to be able to select which spaces an alert appears in, and perhaps to have this be all spaces by default. That said, I don't know enough about how @peterschretlen and @mikecote are thinking about this model and I would be quite interested to hear their thoughts.

My understanding is that before spaces, people would create multiple Kibana instances so they could get their own view of ES data. So the question I ask myself is: if someone wanted to setup multiple Kibana instances pointing at the same stack monitoring indices, what would they be separating? Probably cluster access in the multi-cluster scenario, but I think alerts would be another. SREs want early warnings about a long garbage collection or spike in high indexing rate, but logging users might only understand/care about is if the cluster is red/yellow/green and does it affect their ability to use Discover.

So I lean towards alerts being isolated to a space. I don't imagine there are a lot of use cases for multi-space Stack Monitoring, but I do think Stack Monitoring fits the spaces model well. The more apps we have that are space-aware, the more useful spaces become.

That said we are heading the direction of moving/sharing between spaces, with saved object enhancements like copy to space and considering sharing between spaces.

@chrisronline
Copy link
Contributor Author

@cachedout I've been playing with your idea and I really like it, but I'm not sure the best way to handle it.

I've been researching and playing with two different ways:

  1. When enabling Kibana alerting (which will be a button click in the UI), we update all exporters to blacklist all the cluster alerts.
  2. Changing Elasticsearch code to always blacklist all the cluster alerts.

The first one is nice because it doesn't involve any changes to Elasticsearch, but if a user creates another exporter after enabling Kibana alerting, new cluster alerts are created. We could add a check in Kibana to always ensure all exporters have the same blacklist.

The second one is nice because we don't have to do anything in Kibana, but it does mean that users won't be able to get clusters alerts working ever again. I don't necessarily think we will need to, but it feels possible that something may not work correctly in the first release of Kibana cluster alerts and users might need a fallback. We could get around this by adding configs though.

Thoughts?

@chrisronline
Copy link
Contributor Author

To add some more thoughts to the above ^^

I think we have to go with 2. another downside of option 1 is that this method only works with the cluster connected to Kibana - we are unable to update cluster settings for other monitored clusters.

@cachedout
Copy link
Contributor

we are unable to update cluster settings for other monitored clusters.

While this is a downside, I'm not sure it's a reason to discard this option entirely. If we just document (perhaps as a pop-up when clicking the button to enable Kibana alerting?) that blacklists need to be modified for other non-connected clusters?

The reason that I don't think this is too much of an issue is that the number of current alerts is relatively small and their appearance for most users is exceedingly rare. Even if they don't follow this step and the worst-case is that the watch continues to exist, it's an easy (and hopefully well-documented) fix for them.

Thoughts?

@chrisronline
Copy link
Contributor Author

If we just document (perhaps as a pop-up when clicking the button to enable Kibana alerting?) that blacklists need to be modified for other non-connected clusters?

Yes, this is an option, but it feels like we should try to avoid that if we can (and I think we can in this situation). I think we should always favor the path that involves the least amount of work for the user.

The reason that I don't think this is too much of an issue is that the number of current alerts is relatively small and their appearance for most users is exceedingly rare. Even if they don't follow this step and the worst-case is that the watch continues to exist, it's an easy (and hopefully well-documented) fix for them.

Perhaps, I don't know the data on this honestly. I'm not sure how many folks have more than one production cluster, but it does mean that folks with more than one will need to potentially perform two separate actions: one button click in Kibana, and the other being a manual curl to the other cluster(s) to update the cluster settings.

On top of all of this, I think it gets more complicated when we think about a slow rollout of our migration - It probably makes sense (see discussion here) to release these migrations incrementally as they are ready so we can learn and fix issues along the way. Assuming we want this approach, it complicates the docs a bit where they won't be blacklisting all cluster alerts, but just the few that we have Kibana alerts for.

Option 2 fits in nicely here, as we can simply update the explicit blacklist in each new version we add more Kibana alerts - the user will not have to worry about anything (theoretically at least).

Do you have any issues with option 2?

@cachedout
Copy link
Contributor

Do you have any issues with option 2?

The case you make here is really good. I mentioned over in the discussion on rollout strategies that I think we should gradually introduce the migrated alerts but not enable them until they're all ready. Given that, I think option 2 where we'd just blacklist the existing alerts en masse makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants