Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

Explore UX enhancements for Logs rate #10

Closed
mukeshelastic opened this issue Jan 28, 2020 · 6 comments
Closed

Explore UX enhancements for Logs rate #10

mukeshelastic opened this issue Jan 28, 2020 · 6 comments
Assignees
Labels
design For design issues

Comments

@mukeshelastic
Copy link

mukeshelastic commented Jan 28, 2020

How logs rate works today?

Screen Shot 2020-01-29 at 9 40 51 AM

The top chart in logs rate shows logs histogram as color coded stacked bar chart with each color representing unique dataset. Below that chart, same logs histogram is shown but now buckets are non-stacked and are colored in accordance to anomaly score returned by ML count job.
This ML count job compares actual log counts in 15 mins bucket for each data set with an expected log count value it computed by analyzing logs over past X unit time and flagging the anomalous buckets if it found a meaningful difference between actual and expected counts in a bucket.

Use case

Background
In modern technology world, end user facing web applications often consist of many components coupled together to make the whole system work. These components could be application code written as micro services, functions using services such as cache, databases, search engines, load balancers, api gateways and with infrastructure components such as container deployment platforms ( K8s, CF) compute, memory and network infrastructure on prem or in public cloud, network firewalls etc. Most of these components produce logs data which is used by the observability practitioners to assess component health and performance. For example, load balancers generate access and error logs, databases, search engines, cache generating slow query logs.

What can/will go wrong
Any one of these components could have performance, uptime, configuration problems that suddenly cause it to fail, resulting into cascade of failures across the stack. When a component fails, there is a likelihood of a drop in the log count. If a component gets tons of sudden requests (DDOS attacks, large number of user logins ), then there is a likelihood of a spike in log count.

What users are looking for
Any unexpected movement in these types of logs is an indicator of potential end user performance or uptime problem. SREs responsible for such systems need to detect and recover from such failures as soon as possible.

How ML can help
ML count job runs on logs originated from many datasets and can help identify and flag the anomalous increase or decrease in log count in real time by comparing the count with expected value it computed by analyzing logs count for past X unit time.

Suggested short term improvements

  1. Allow easy correlation of anomalous buckets between the datasets along time axis (for example: heat map, swim lane type three dimensional view - 1D ->time, 2D -> dataset name, 3D ->bucket color) Similar to how ML displays the swim lanes. In addition to these swim lanes, also allow users to compare multiple datasets on single time axis to identify which component started behaving badly first.

This helps users understand which components were part of cascading failure and identify the first one to show anomolous behavior.

  1. Allow easy movement back and forth between anomalies graphs and logs stream for the specific datasets and time selected on anomalies graphs.
    This helps users iterate the troubleshooting process of figuring out a problematic component to what went wrong in that component and how the failures manifested in other components. Very likely users will need to jump from anomalies graphs to filtered logs streams and back till they figure out the root cause or jump into other observability data inputs such as metrics, APM etc.

  2. Allow easy movement to metrics or APM when they are enabled for a given dataset
    Anomalous log count is potential symptom of component problem. Looking at metrics for that component or the components it depends upon ( e.g. ElasticSearch running on K8s) can potentially lead you to root cause.

  3. The logs entries histogram is just an histogram of logs in each dataset across time. It looks cool but provides zero value in troubleshooting process. We should re-evaluate it's existence.

  4. I doubt our existing sample web logs can help show case the real value of ML count analysis as it only has a single dataset. So we should explore different options of sample logs data to show case ML value.

Longer term improvements
6. When we enable alerting, users should have a way to create alerts for critical/major anomalies.

  1. It's very likely that not all anomalies are going to be accurate. Users should have some way to provide that feedback and get help in how the ML job can be tuned to provide as accurate expected count as possible.
@mukeshelastic
Copy link
Author

mukeshelastic commented Jan 28, 2020

Katrin and I brainstormed on this topic on 27th Jan Monday morning 9 AM PDT which led to most of the enhancements in issue description. @katrin-freihofner please feel free to chime if I missed anything from our conversation.

@katrin-freihofner
Copy link
Contributor

Figma prototype

@katrin-freihofner
Copy link
Contributor

Today we agreed on the basic user flow shown in this prototype.

There are still many open questions. We are going to split the changes into smaller buckets to address these questions in detail.

We think the next steps should be:

  1. Anomaly list
    Detailed design which tackles questions like: which types of anomalies do we want to show and which details do we provide

  2. Swimlane
    Detailed design which includes the interaction and UI design.

  3. The chart in the middle of the screen

...additionally, we are investigating if it makes sense to show the categories also in the context of the log stream.

anomaly explorer

@weltenwort
Copy link
Member

What changes to the job setup process have we considered in this context? Now that a single tab shows the results of multiple jobs we probably need some other way of managing the job lifecycles.

@katrin-freihofner
Copy link
Contributor

To address these changes in detail, I split it up in the following parts (ordered by priority):

  1. ML job setup
  2. Anomaly table/list
  3. Swimlane
  4. Categories in log stream
  5. Details chart

@katrin-freihofner
Copy link
Contributor

katrin-freihofner commented Mar 23, 2020

@mukeshelastic and I just had a meeting to discuss the proposed priorities and changed them as follows:

  1. Anomaly table/list
  2. ML job setup
  3. Swimlane
  4. Categories in log stream
  5. Details chart

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
design For design issues
Projects
None yet
Development

No branches or pull requests

4 participants