-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement of the UI of the main alerts page for alertmanager #911
Comments
Thank you for your comment. Could you provide some more information on your use case, such as why you would like to see all alerts? Are you having, e.g., the same alert firing across hundreds of nodes? The UI currently has a filtering and grouping feature that we hope would allow most users to "create" an alerts page that is relevant to them and be of a manageable size. An idea that immediately comes to mind is collapsing the groupings and showing the metadata you're interested in, and then having a "click to expand" feature. This already exists, but on the individual alert level. We also need to be mindful of the needs of all users -- it sounds like you might have many more alerts than the average user, and one of our challenges is providing a single product for teams of very different sizes. The current setup might be working well for 90% of use cases, and introducing an extra step to seeing alerts could be annoying to them. |
I think my use case scenario would be the following: I want to be able to have a page where on first glance, I can see only the alertname that is currently being fired(maybe one alertname per row). And in addition, where i can also see the total number of servers affected by this alerting rule. This comes in handy when we're watching hundreds of servers and somehow, majority of the servers are affected by this one alerting rule. At the moment, if this were to happen, it would be a long page of alerts with each server/node in each row. This becomes clunky when we also have some inhibition rules in place and it's very hard to navigate around. Let me know if I was clear. If not, I'll try to post a graphic representation of what I'm talking about. |
Hm, I'm thinking of two different things:
@beorn7 @matthiasr @grobie have you all encountered issues that aren't instance/pod specific, but having entries for each one of these unique label sets in the UI is noise you would like to collapse? |
I want to second this issue as we also have a considerable environment. The Prometheus alert pane (with default collapse after group on alertname) is very readable even with hundreds of alerts spread across multiple alertnames. With one view of the eye, you know whether things are good or bad. For Alertmanager on the other hand, without collapsing on alertname, this becomes very troublesome and you can only judge based on the size of the scroller. (small is ok, tiny is bad). Obviously this isn't usable and I notice our people typically prefer the default prometheus alert view over the alertmanager view, even though the alertmanager functionality isn't available there. |
@stuartnelson3 👍 for suggestion # 1, @kpachhai would that make your life a little easier? @pieterdejaeghere Would Stuarts suggestion # 1 work for you? You would just have to group by alertname, the list of alerts would be collapsed per group by default, and you would see the number of alerts per alertname. |
suggestion # 1 would be excellent, as that means UI parity with the Prometheus alert screen while keeping Alertmanager goodness (silencing). |
My 2¢: I rarely use the main alert page as it is now. I either select a receiver immediately to filter by, or I reach alertmanager via a link or bookmark that already contains a receiver. I would thus leave the decision how to represent all currently firing alerts to those that actually have the use case of looking at all alerts. However, I would like to raise the scalability concern: If the main page loads all firing alerts, it can become super slow or even crash the browser if there are many firing alerts. Usually, there aren't many alerts firing, but in case of a wide-spread incidents, there will be. And that's the case where you need a slick AM UI most dearly. We should thus aim for some limitation of the number of alerts loaded from AM and represent the case somehow (the easiest way would be "too many alerts firing, select filter to narrow selection", or some kind of pagination; for something more complex we needed a separate endpoint for something like "group name and number of alerts in the group"). |
I can't remember properly right now, but I believe at one point we loaded 1000 alerts, the app scrolled fine, and we moved on. It would be good to know if this is a problem before we start engineering around it. @kpachhai @pieterdejaeghere can you give a rough estimate of how many alerts are being shown on the page, and if there's performance degradation/crashes? |
Since AM is the one instance where all alerts of an organization arrive, we have to assume there might be millions. I'm not claiming we need to be scalable to infinity, but the UI needs to scale to the same number of firing alerts that the AM backend can handle. |
And one other aspect: If I'm on-call on the road, I sometimes need to work with low-bandwidth (e.g. shared crappy WiFi) and/or very expensive connection to the internet (e.g. tethering my phone). Even if thousands of alerts load fine with a broadband connection, and the browser on my 16GiB laptop is fine handling that, it might be very inconvenient in a situation like the above, where the first thing I do might be to look at the AM landing page (perhaps even on the browser of my smallish mobile phone). |
We currently have a bit over 400 open and performance seems great (i5-6300u, 16gb, Firefox). However, with 13000 monitored devices, i wouldn't be surprised if we could have a theoretical alarm count of multiple tens of thousands of alarms. Because of the difficulty in displaying alerts in Alertmanager, I must say my experience is limited, not a production implementation yet. |
Hah, this is a no-judging zone :) Ok so this issue is breaking out into a few different things, so let's rein it back in to the initial issue: Collapsing groups on the alerts page. I'll make a separate issue summarizing the points brought up by @beorn7 @w0rm do you have any spare cycles to look at this? |
suggestion # 1 seems good enough for my use case as well. Thanks, everyone! |
closed by #1876. additional refinements should be opened in different issues. |
At the moment, if we have lots of alerting rules and accordingly, if there are lots of alerts being fired off, the alerts page just shows a long list of all the alerts. It would be nice to improve the UI for this page so it's easy on the eyes. I was thinking more along the lines of if we could improve the UI so that we summarize different kinds of alerts on the first page, and then maybe it'll show how many servers/nodes were affected with that particular alert and any other summary that might be helpful. This would be in a list categorized by different kinds of alerts. Then, each row would have the link that would take you to another page that then shows all the details for that specific alert.
Basically, an overhaul of the main alertmanager page
The text was updated successfully, but these errors were encountered: