Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifications in monitoring chapter #1000

Merged
merged 4 commits into from
Dec 23, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -15,52 +15,22 @@ summary:

:context: monitoring-che

{prod-short} can expose certain data as metrics, that can be processed by Prometheus
themr0c marked this conversation as resolved.
Show resolved Hide resolved
ifeval::["{project-context}" == "che"]
and Grafana stack
endif::[]
.

Prometheus is a monitoring system, that maintains the collection of metrics - time series key-value data which can represent consumption of resources like CPU and memory, amount of processed HTTP queries and their execution time, and {prod-short} specific resources, such as number of users and workspaces, the start and shutdown of workspaces, information about JsonRPC stack.

Prometheus is powered with a special query language, that allows manipulating the collected data, and perform various binary, vector and aggregation operations with it, to help create a more refined view on data.

ifeval::["{project-context}" == "che"]
Grafana offers a front-end "facade" with tools to create a various visual representation in the form of dashboards with various panels and graph types.
endif::[]

Note that this monitoring stack is not an official production-ready solution, but rather has an introduction purpose.

.The structure of {prod-short} monitoring stack
image::monitoring/monitoring-che-stack-structure.png[link="{imagesdir}/monitoring/monitoring-che-stack-structure.png"]
This chapter describes how to configure {prod-short} to expose metrics and how to build an example monitoring stack with external tools to process data exposed as metrics by {prod-short}.

[id="enabling-{prod-id-short}-metrics-collections"]
== Enabling {prod-short} metrics collections

[id='prerequisites-{context}',discrete]
.Prerequisites

* Installed Prometheus 2.9.1 or above. See more link:https://prometheus.io/docs/introduction/first_steps/[https://prometheus.io/docs/introduction/first_steps/].
* Installed Grafana 6.0 or above. See more at link:https://grafana.com/docs/installation/[https://grafana.com/docs/installation/]

.Procedure

. Set the `CHE_METRICS_ENABLED=true` environment variable
. Expose the `8087` port as a service on the che-master host
. Configure Prometheus to scrape metrics from the `8087` port
. Configure a Prometheus data source on Grafana
. Deploy {prod-short}-specific dashboards on Grafana
include::proc_enabling-and-exposing-che-metrics.adoc[leveloffset=+1]

include::proc_collecting-che-metrics-with-prometheus.adoc[leveloffset=+1]

ifeval::["{project-context}" == "che"]

include::proc_viewing-che-metrics-on-grafana-dashboards.adoc[leveloffset=+1]

include::proc_developing-grafana-dashboards.adoc[leveloffset=+1]
include::ref_grafana-dashboards-for-che.adoc[leveloffset=+1]

include::proc_extending-che-monitoring-metrics.adoc[leveloffset=+1]
include::proc_developing-grafana-dashboards.adoc[leveloffset=+1]

endif::[]

include::proc_extending-che-monitoring-metrics.adoc[leveloffset=+1]

:context: {parent-context-of-monitoring-che}
Original file line number Diff line number Diff line change
@@ -1,27 +1,19 @@
[id="collecting-{prod-id-short}-metrics-with-prometheus_{context}"]
= Collecting {prod-short} metrics with Prometheus

Prometheus is a monitoring system that collects metrics in real time and stores them in a time series database.
This section describes how to use the Prometheus monitoring system to collect, store and query metrics about {prod-short}.

Prometheus comes with a console accessible at the `9090` port of the application pod. By default, a template provides an existing *service* and a *route* to access it. It can be used to query and view metrics.
.Prerequisites

ifeval::["{project-context}" == "che"]
image::monitoring/monitoring-che-prometheus-console.png[link="{imagesdir}/monitoring/monitoring-che-prometheus-console.png"]
endif::[]
* {prod-short} is exposing metrics on port `8087`. See xref:enabling-and-exposing-{prod-id-short}-metrics_{context}[].

== Prometheus terminology
* Prometheus 2.9.1 or above is running. Prometheus console is running on port `9090` with a corresponding *service* and *route*. See link:https://prometheus.io/docs/introduction/first_steps/[First steps with Prometheus].

Prometheus offers:
.Procedure

counter:: the simplest numerical type of metric whose value can be only increased. A typical example is counting the amount of HTTP requests that go through the system.

gauge:: numerical value that can be increased or decreased. Best suited for representing values of objects.

histogram:: a more complex metric that is suited for performing observations. Metrics are collected and grouped in configurable buckets, which allwos to present the results, for instance, in a form of a heatmap.

== Configuring Prometheus

.Prometheus configuration
* Configure Prometheus to scrape metrics from the `8087` port
+
.Prometheus configuration example
[source,yaml,subs="+attributes"]
----
- apiVersion: v1
Expand All @@ -33,10 +25,26 @@ histogram:: a more complex metric that is suited for performing observations. Me
scrape_configs:
- job_name: 'che'
static_configs:
- targets: ['{prod-host}:8087']
- targets: ['{prod-host}:8087'] <3>
kind: ConfigMap
metadata:
name: prometheus-config
----
+
<1> rate, at which a target is scraped
<2> rate, at which recording and alerting rules are re-checked (not used in our system at the moment)
<3> scrape metrics from the `8087` port

.Verification steps

* Use the Prometheus console to query and view metrics.

.Additional resources

* link:https://prometheus.io/docs/introduction/first_steps/[First steps with Prometheus].

* link:https://prometheus.io/docs/prometheus/latest/configuration/configuration/[Configuring Prometheus].

* link:https://prometheus.io/docs/prometheus/latest/querying/basics/[Querying Prometheus].

* link:https://prometheus.io/docs/concepts/metric_types/[Prometheus metric types].
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[id="enabling-and-exposing-{prod-id-short}-metrics_{context}"]
= Enabling and exposing {prod-short} metrics

This section describes how to enable and expose {prod-short} metrics.

.Procedure

. Set the `CHE_METRICS_ENABLED=true` environment variable

. Expose the `8087` port as a service on the che-master host
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
[id="extending-{prod-id-short}-monitoring-metrics_{context}"]
= Extending {prod-short} monitoring metrics

This section describes how to create a metric or a group of metrics to extend the monitoring metrics that {prod-short} is exposing.

There are two major modules for metrics:

* `che-core-metrics-core` -- contains core metrics module
Expand All @@ -9,10 +11,10 @@ There are two major modules for metrics:

.Procedure

To create a metric or a group of metrics, you need a class that extends the `MeterBinder` class. This allows to register the created metric in the overriden `bindTo(MeterRegistry registry)` method.

* Create a class that extends the `MeterBinder` class. This allows to register the created metric in the overriden `bindTo(MeterRegistry registry)` method.
+
The following is an example of a metric that has a function that supplies the value for it:

+
.Example metric
[source,java]
----
Expand Down Expand Up @@ -40,13 +42,11 @@ public class UserMeterBinder implements MeterBinder {
}
}
----

+
Alternatively, the metric can be stored with a reference and updated manually in some other place in the code.


.Additional resources

For more information about the types of metrics and naming conventions, visit Prometheus documentation:

* link:https://prometheus.io/docs/practices/naming/[Naming practices]
* link:https://prometheus.io/docs/concepts/metric_types/[Metric types]
* link:https://prometheus.io/docs/practices/naming/[Metric and label naming for Prometheus]
* link:https://prometheus.io/docs/concepts/metric_types/[Metric types for Prometheus]
Original file line number Diff line number Diff line change
@@ -1,174 +1,33 @@
[id="viewing-{prod-id-short}-metrics-on-grafana-dashboards_{context}"]
= Viewing {prod-short} metrics on Grafana dashboards

Grafana is used for informative representation of Prometheus metrics. Providing visibility for OpenShift, Grafana’s deployment configuration and ConfigMaps are located in the `che-monitoring.yaml` configuration file.
This section describes how to view {prod-short} metrics on Grafana dashboards.

.Prerequisites

== Configuring and deploying Grafana
* Prometheus is collecting metrics on the {prod-short} cluster. See xref:collecting-{prod-id-short}-metrics-with-prometheus_{context}[].

Grafana is run on port `3000` with a corresponding *service* and *route*.
* Grafana 6.0 or above is running on port `3000` with a corresponding *service* and *route*. See link:https://grafana.com/docs/installation/[Installing Grafana].

Three ConfigMaps are used to configure Grafana:

.Procedure

. Deploy {prod-short}-specific dashboards on Grafana using the `che-monitoring.yaml` configuration file.
+
Three ConfigMaps are used to configure Grafana:
+
* `grafana-datasources` -- configuration for Grafana datasource, a Prometheus endpoint
* `grafana-dashboards` -- configuration of Grafana dashboards and panels
* `grafana-dashboard-provider` -- configuration of the Grafana dashboard provider API object, which tells Grafana where to look in the file system for pre-provisioned dashboards


== Grafana dashboards overview

{prod-short} provides several types of dashboards.


=== {prod-short} server dashboard

Use case: {prod-short} server-specific metrics related to {prod-short} components, such as workspaces or users.

.The *General* panel
image::monitoring/monitoring-che-che-server-dashboard-general-panel.png[]

The *General* panel contains basic information, such as the total number of users and workspaces in the {prod-short} database.

.The *Workspaces* panel
image::monitoring/monitoring-che-che-server-dashboard-workspace-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-workspace-panel.png"]

* *Workspace start rate* -- the ratio between successful and failed started workspaces
* *Workspace stop rate* -- the ratio between successful and failed stopped workspaces
* *Workspace Failures* -- the number of workspace failures shown on the graph
* *Starting Workspaces* -- the gauge that shows the number of currently starting workspaces
* *Average Workspace Start Time* -- 1-hour average of workspace starts or fails
* *Average Workspace Stop Time* -- 1-hour average of workspace stops
* *Running Workspaces* -- the gauge that shows the number of currently running workspaces
* *Stopping Workspaces* -- the gauge that shows the number of currently stopping workspaces
* *Workspaces started under 60 seconds* -- the percentage of workspaces started under 60 seconds
* *Number of Workspaces* -- the number of workspaces created over time

.The *Users* panel
image::monitoring/monitoring-che-che-server-dashboard-users-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-users-panel.png"]

* *Number of Users* -- the number of users known to {prod-short} over time


.The *Tomcat* panel
image::monitoring/monitoring-che-che-server-dashboard-tomcat-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-tomcat-panel.png"]

* *Max number of active sessions* -- the max number of active sessions that have been active at the same time
* *Number of current active sessions* -- the number of currently active sessions
* *Total sessions* -- the total number of sessions
* *Expired sessions* -- the number of sessions that have expired
* *Rejected sessions* -- the number of sessions that were not created because the maximum number of active sessions was reached
* *Longest time of an expired session* -- the longest time (in seconds) that an expired session had been alive

.The *Request* panel
image::monitoring/monitoring-che-che-server-dashboard-requests-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-requests-panel.png"]

The *Requests* panel displays HTTP requests in a graph that shows the average number of requests per minute.

.The *Executors* panel, part 1
image::monitoring/monitoring-che-che-server-dashboard-executors-panel-1.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-1.png"]

* *Threads running* - the number of threads that are not terminated aka alive. May include threads that are in a waiting or blocked state.
* *Threads terminated* - the number of threads that was finished its execution.
* *Threads created* - number of threads created by thread factory for given executor service.
* *Created thread/minute* - Speed of thread creating for the given executor service.

.The *Executors* panel, part 2
image::monitoring/monitoring-che-che-server-dashboard-executors-panel-2.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-2.png"]

* *Executor threads active* - number of threads that actively execute tasks.
* *Executor pool size* - number of threads that actively execute tasks.
* *Queued task* - the approximate number of tasks that are queued for execution
* *Queued occupancy* - the percent of the queue used by the tasks that is waining for execution.

.The *Executors* panel, part 3
image::monitoring/monitoring-che-che-server-dashboard-executors-panel-3.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-3.png"]

* *Rejected task* - the number of tasks that were rejected from execution.
* *Rejected task/minute* - the speed of task rejections
* *Completed tasks* - the number of completed tasks
* *Completed tasks/minute* - the speed of task execution

.The *Executors* panel, part 4
image::monitoring/monitoring-che-che-server-dashboard-executors-panel-4.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-4.png"]

* *Task execution seconds max* - 5min moving maximum of task execution
* *Tasks execution seconds avg* - 1h moving average of task execution
* *Executor idle seconds max* - 5min moving maximum of executor idle state.
* *Executor idle seconds avg* - 1h moving average of executor idle state.

.The *Traces* panel, part 1
image::monitoring/monitoring-che-che-server-dashboard-trace-panel-1.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-1.png"]

* *Workspace start Max* - maximum workspace start time
* *Workspace start Avg* - 1h moving average of the workspace start time components
* *Workspace stop Max* - maximum of workspace stop time
* *Workspace stop Avg* - 1h moving average of the workspace stop time components

.The *Traces* panel, part 2
image::monitoring/monitoring-che-che-server-dashboard-trace-panel-2.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-2.png"]

* *OpenShiftInternalRuntime#start Max* - maximum time of OpenShiftInternalRuntime#start operation
* *OpenShiftInternalRuntime#start Avg* - 1h moving average time of OpenShiftInternalRuntime#start operation
* *Plugin Brokering Execution Max* - maximum time of PluginBrokerManager#getTooling operation
* *Plugin Brokering Execution Avg* - 1h moving average of PluginBrokerManager#getTooling operation

.The *Traces* panel, part 3
image::monitoring/monitoring-che-che-server-dashboard-trace-panel-3.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-3.png"]

* *OpenShiftEnvironmentProvisioner#provision Max* - maximum time of OpenShiftEnvironmentProvisioner#provision operation
* *OpenShiftEnvironmentProvisioner#provision Avg* -1h moving average of OpenShiftEnvironmentProvisioner#provision operation
* *Plugin Brokering Execution Max* - maximum time of PluginBrokerManager#getTooling components execution time
* *Plugin Brokering Execution Avg* - 1h moving average of time of PluginBrokerManager#getTooling components execution time

.The *Traces* panel, part 4
image::monitoring/monitoring-che-che-server-dashboard-trace-panel-4.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-4.png"]

* *WaitMachinesStart Max* - maximim time of WaitMachinesStart operations
* *WaitMachinesStart Avg* - 1h moving average time of WaitMachinesStart operations
* *OpenShiftInternalRuntime#startMachines Max* - maximim time of OpenShiftInternalRuntime#startMachines operations
* *OpenShiftInternalRuntime#startMachines Avg* - 1h moving average of the time of OpenShiftInternalRuntime#startMachines operations

.The *Workspace detailed* panel
image::monitoring/monitoring-che-che-server-dashboard-workspace-detailed-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-workspace-detailed-panel.png"]

The *Workspace Detailed* panel contains heat maps, which illustrate the average time of workspace starts or fails. The row shows some period of time.


=== {prod-short} server JVM dashboard

Use case: JVM metrics of the {prod-short} server, such as JVM memory or classloading.

.{prod-short} server JVM dashboard
image::monitoring/monitoring-che-che-server-jvm-dashboard.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard.png"]

.Quick Facts
image::monitoring/monitoring-che-che-server-jvm-dashboard-quick-facts.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-quick-facts.png"]

.JVM Memory
image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory.png"]

.JVM Misc
image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-misc.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-misc.png"]

.JVM Memory Pools (heap)
image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-heap.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-heap.png"]

.JVM Memory Pools (Non-Heap)
image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-non-heap.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-non-heap.png"]

.Garbage Collection
image::monitoring/monitoring-che-che-server-jvm-dashboard-garbage-collection.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-garbage-collection.png"]

.Classloading
image::monitoring/monitoring-che-che-server-jvm-dashboard-classloading.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-classloading.png"]
.Verification steps

.Buffer Pools
image::monitoring/monitoring-che-che-server-jvm-dashboard-buffer-pools.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-buffer-pools.png"]
* Use the Grafana console to view {prod-short} metrics.

.Additional resources

// [discrete]
// == Additional resources
//
// * A bulleted list of links to other material closely related to the contents of the procedure module.
// * For more details on writing procedure modules, see the link:https://github.com/redhat-documentation/modular-docs#modular-documentation-reference-guide[Modular Documentation Reference Guide].
// * Use a consistent system for file names, IDs, and titles. For tips, see _Anchor Names and File Names_ in link:https://github.com/redhat-documentation/modular-docs#modular-documentation-reference-guide[Modular Documentation Reference Guide].
* link:https://grafana.com/docs/installation/[Installing Grafana].
Loading