-
Notifications
You must be signed in to change notification settings - Fork 2
Explore UX enhancements for Logs rate #10
Comments
Katrin and I brainstormed on this topic on 27th Jan Monday morning 9 AM PDT which led to most of the enhancements in issue description. @katrin-freihofner please feel free to chime if I missed anything from our conversation. |
Today we agreed on the basic user flow shown in this prototype. There are still many open questions. We are going to split the changes into smaller buckets to address these questions in detail. We think the next steps should be:
...additionally, we are investigating if it makes sense to show the categories also in the context of the log stream. |
What changes to the job setup process have we considered in this context? Now that a single tab shows the results of multiple jobs we probably need some other way of managing the job lifecycles. |
To address these changes in detail, I split it up in the following parts (ordered by priority): |
@mukeshelastic and I just had a meeting to discuss the proposed priorities and changed them as follows: |
How logs rate works today?
The top chart in logs rate shows logs histogram as color coded stacked bar chart with each color representing unique dataset. Below that chart, same logs histogram is shown but now buckets are non-stacked and are colored in accordance to anomaly score returned by ML count job.
This ML count job compares actual log counts in 15 mins bucket for each data set with an expected log count value it computed by analyzing logs over past X unit time and flagging the anomalous buckets if it found a meaningful difference between actual and expected counts in a bucket.
Use case
Background
In modern technology world, end user facing web applications often consist of many components coupled together to make the whole system work. These components could be application code written as micro services, functions using services such as cache, databases, search engines, load balancers, api gateways and with infrastructure components such as container deployment platforms ( K8s, CF) compute, memory and network infrastructure on prem or in public cloud, network firewalls etc. Most of these components produce logs data which is used by the observability practitioners to assess component health and performance. For example, load balancers generate access and error logs, databases, search engines, cache generating slow query logs.
What can/will go wrong
Any one of these components could have performance, uptime, configuration problems that suddenly cause it to fail, resulting into cascade of failures across the stack. When a component fails, there is a likelihood of a drop in the log count. If a component gets tons of sudden requests (DDOS attacks, large number of user logins ), then there is a likelihood of a spike in log count.
What users are looking for
Any unexpected movement in these types of logs is an indicator of potential end user performance or uptime problem. SREs responsible for such systems need to detect and recover from such failures as soon as possible.
How ML can help
ML count job runs on logs originated from many datasets and can help identify and flag the anomalous increase or decrease in log count in real time by comparing the count with expected value it computed by analyzing logs count for past X unit time.
Suggested short term improvements
This helps users understand which components were part of cascading failure and identify the first one to show anomolous behavior.
Allow easy movement back and forth between anomalies graphs and logs stream for the specific datasets and time selected on anomalies graphs.
This helps users iterate the troubleshooting process of figuring out a problematic component to what went wrong in that component and how the failures manifested in other components. Very likely users will need to jump from anomalies graphs to filtered logs streams and back till they figure out the root cause or jump into other observability data inputs such as metrics, APM etc.
Allow easy movement to metrics or APM when they are enabled for a given dataset
Anomalous log count is potential symptom of component problem. Looking at metrics for that component or the components it depends upon ( e.g. ElasticSearch running on K8s) can potentially lead you to root cause.
The logs entries histogram is just an histogram of logs in each dataset across time. It looks cool but provides zero value in troubleshooting process. We should re-evaluate it's existence.
I doubt our existing sample web logs can help show case the real value of ML count analysis as it only has a single dataset. So we should explore different options of sample logs data to show case ML value.
Longer term improvements
6. When we enable alerting, users should have a way to create alerts for critical/major anomalies.
The text was updated successfully, but these errors were encountered: