Explore UX enhancements for Logs rate #10

mukeshelastic · 2020-01-28T08:09:48Z

How logs rate works today?

The top chart in logs rate shows logs histogram as color coded stacked bar chart with each color representing unique dataset. Below that chart, same logs histogram is shown but now buckets are non-stacked and are colored in accordance to anomaly score returned by ML count job.
This ML count job compares actual log counts in 15 mins bucket for each data set with an expected log count value it computed by analyzing logs over past X unit time and flagging the anomalous buckets if it found a meaningful difference between actual and expected counts in a bucket.

Use case

Background
In modern technology world, end user facing web applications often consist of many components coupled together to make the whole system work. These components could be application code written as micro services, functions using services such as cache, databases, search engines, load balancers, api gateways and with infrastructure components such as container deployment platforms ( K8s, CF) compute, memory and network infrastructure on prem or in public cloud, network firewalls etc. Most of these components produce logs data which is used by the observability practitioners to assess component health and performance. For example, load balancers generate access and error logs, databases, search engines, cache generating slow query logs.

What can/will go wrong
Any one of these components could have performance, uptime, configuration problems that suddenly cause it to fail, resulting into cascade of failures across the stack. When a component fails, there is a likelihood of a drop in the log count. If a component gets tons of sudden requests (DDOS attacks, large number of user logins ), then there is a likelihood of a spike in log count.

What users are looking for
Any unexpected movement in these types of logs is an indicator of potential end user performance or uptime problem. SREs responsible for such systems need to detect and recover from such failures as soon as possible.

How ML can help
ML count job runs on logs originated from many datasets and can help identify and flag the anomalous increase or decrease in log count in real time by comparing the count with expected value it computed by analyzing logs count for past X unit time.

Suggested short term improvements

Allow easy correlation of anomalous buckets between the datasets along time axis (for example: heat map, swim lane type three dimensional view - 1D ->time, 2D -> dataset name, 3D ->bucket color) Similar to how ML displays the swim lanes. In addition to these swim lanes, also allow users to compare multiple datasets on single time axis to identify which component started behaving badly first.

This helps users understand which components were part of cascading failure and identify the first one to show anomolous behavior.

Allow easy movement back and forth between anomalies graphs and logs stream for the specific datasets and time selected on anomalies graphs.
This helps users iterate the troubleshooting process of figuring out a problematic component to what went wrong in that component and how the failures manifested in other components. Very likely users will need to jump from anomalies graphs to filtered logs streams and back till they figure out the root cause or jump into other observability data inputs such as metrics, APM etc.
Allow easy movement to metrics or APM when they are enabled for a given dataset
Anomalous log count is potential symptom of component problem. Looking at metrics for that component or the components it depends upon ( e.g. ElasticSearch running on K8s) can potentially lead you to root cause.
The logs entries histogram is just an histogram of logs in each dataset across time. It looks cool but provides zero value in troubleshooting process. We should re-evaluate it's existence.
I doubt our existing sample web logs can help show case the real value of ML count analysis as it only has a single dataset. So we should explore different options of sample logs data to show case ML value.

Longer term improvements
6. When we enable alerting, users should have a way to create alerts for critical/major anomalies.

It's very likely that not all anomalies are going to be accurate. Users should have some way to provide that feedback and get help in how the ML job can be tuned to provide as accurate expected count as possible.

mukeshelastic · 2020-01-28T08:13:47Z

Katrin and I brainstormed on this topic on 27th Jan Monday morning 9 AM PDT which led to most of the enhancements in issue description. @katrin-freihofner please feel free to chime if I missed anything from our conversation.

katrin-freihofner · 2020-03-09T15:24:49Z

Figma prototype

katrin-freihofner · 2020-03-19T14:56:10Z

Today we agreed on the basic user flow shown in this prototype.

There are still many open questions. We are going to split the changes into smaller buckets to address these questions in detail.

We think the next steps should be:

Anomaly list
Detailed design which tackles questions like: which types of anomalies do we want to show and which details do we provide
Swimlane
Detailed design which includes the interaction and UI design.
The chart in the middle of the screen

...additionally, we are investigating if it makes sense to show the categories also in the context of the log stream.

weltenwort · 2020-03-19T14:59:39Z

What changes to the job setup process have we considered in this context? Now that a single tab shows the results of multiple jobs we probably need some other way of managing the job lifecycles.

katrin-freihofner · 2020-03-23T08:43:18Z

To address these changes in detail, I split it up in the following parts (ordered by priority):

katrin-freihofner · 2020-03-23T14:40:51Z

@mukeshelastic and I just had a meeting to discuss the proposed priorities and changed them as follows:

mukeshelastic added the design For design issues label Jan 28, 2020

mukeshelastic assigned katrin-freihofner Jan 28, 2020

katrin-freihofner added the [zube]: In Progress label Jan 29, 2020

katrin-freihofner mentioned this issue Feb 24, 2020

[Logs UI] navigation improvements elastic/kibana#55390

Closed

katrin-freihofner added [zube]: In Review and removed [zube]: In Progress labels Mar 23, 2020

This was referenced Mar 23, 2020

Anomaly detection setup #14

Closed

Categories in log stream #15

Closed

Anomaly table/list #16

Closed

Swimlane visualization for anomalies #17

Closed

Anomaly details chart #18

Closed

katrin-freihofner closed this as completed Mar 23, 2020

katrin-freihofner added [zube]: Done and removed [zube]: In Review labels Mar 23, 2020

alvarolobato removed the [zube]: Done label Mar 23, 2020

katrin-freihofner mentioned this issue May 4, 2020

[Logs UI][Meta] new anomaly detection user flow elastic/kibana#65071

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore UX enhancements for Logs rate #10

Explore UX enhancements for Logs rate #10

mukeshelastic commented Jan 28, 2020 •

edited

Loading

mukeshelastic commented Jan 28, 2020 •

edited

Loading

katrin-freihofner commented Mar 9, 2020

katrin-freihofner commented Mar 19, 2020

weltenwort commented Mar 19, 2020

katrin-freihofner commented Mar 23, 2020

katrin-freihofner commented Mar 23, 2020 •

edited

Loading

Explore UX enhancements for Logs rate #10

Explore UX enhancements for Logs rate #10

Comments

mukeshelastic commented Jan 28, 2020 • edited Loading

mukeshelastic commented Jan 28, 2020 • edited Loading

katrin-freihofner commented Mar 9, 2020

katrin-freihofner commented Mar 19, 2020

weltenwort commented Mar 19, 2020

katrin-freihofner commented Mar 23, 2020

katrin-freihofner commented Mar 23, 2020 • edited Loading

mukeshelastic commented Jan 28, 2020 •

edited

Loading

mukeshelastic commented Jan 28, 2020 •

edited

Loading

katrin-freihofner commented Mar 23, 2020 •

edited

Loading