Skip to content

Commit

Permalink
Clarifications in monitoring chapter (#1000)
Browse files Browse the repository at this point in the history
* Clarifications in monitoring chapter

* Update src/main/pages/che-7/administration-guide/assembly_monitoring-che.adoc

Co-Authored-By: Florent BENOIT <[email protected]>

* handle feedback from @skabashnyuk - add visibility to enabling and exposing che metrics

* rework Viewing Che metrics on Grafana dashboards

Co-authored-by: Florent BENOIT <[email protected]>
  • Loading branch information
themr0c and benoitf authored Dec 23, 2019
1 parent 40fb6b1 commit a588080
Show file tree
Hide file tree
Showing 6 changed files with 208 additions and 216 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -15,52 +15,22 @@ summary:

:context: monitoring-che

{prod-short} can expose certain data as metrics, that can be processed by Prometheus
ifeval::["{project-context}" == "che"]
and Grafana stack
endif::[]
.

Prometheus is a monitoring system, that maintains the collection of metrics - time series key-value data which can represent consumption of resources like CPU and memory, amount of processed HTTP queries and their execution time, and {prod-short} specific resources, such as number of users and workspaces, the start and shutdown of workspaces, information about JsonRPC stack.

Prometheus is powered with a special query language, that allows manipulating the collected data, and perform various binary, vector and aggregation operations with it, to help create a more refined view on data.

ifeval::["{project-context}" == "che"]
Grafana offers a front-end "facade" with tools to create a various visual representation in the form of dashboards with various panels and graph types.
endif::[]

Note that this monitoring stack is not an official production-ready solution, but rather has an introduction purpose.

.The structure of {prod-short} monitoring stack
image::monitoring/monitoring-che-stack-structure.png[link="{imagesdir}/monitoring/monitoring-che-stack-structure.png"]
This chapter describes how to configure {prod-short} to expose metrics and how to build an example monitoring stack with external tools to process data exposed as metrics by {prod-short}.

[id="enabling-{prod-id-short}-metrics-collections"]
== Enabling {prod-short} metrics collections

[id='prerequisites-{context}',discrete]
.Prerequisites

* Installed Prometheus 2.9.1 or above. See more link:https://prometheus.io/docs/introduction/first_steps/[https://prometheus.io/docs/introduction/first_steps/].
* Installed Grafana 6.0 or above. See more at link:https://grafana.com/docs/installation/[https://grafana.com/docs/installation/]

.Procedure

. Set the `CHE_METRICS_ENABLED=true` environment variable
. Expose the `8087` port as a service on the che-master host
. Configure Prometheus to scrape metrics from the `8087` port
. Configure a Prometheus data source on Grafana
. Deploy {prod-short}-specific dashboards on Grafana
include::proc_enabling-and-exposing-che-metrics.adoc[leveloffset=+1]

include::proc_collecting-che-metrics-with-prometheus.adoc[leveloffset=+1]

ifeval::["{project-context}" == "che"]

include::proc_viewing-che-metrics-on-grafana-dashboards.adoc[leveloffset=+1]

include::proc_developing-grafana-dashboards.adoc[leveloffset=+1]
include::ref_grafana-dashboards-for-che.adoc[leveloffset=+1]

include::proc_extending-che-monitoring-metrics.adoc[leveloffset=+1]
include::proc_developing-grafana-dashboards.adoc[leveloffset=+1]

endif::[]

include::proc_extending-che-monitoring-metrics.adoc[leveloffset=+1]

:context: {parent-context-of-monitoring-che}
Original file line number Diff line number Diff line change
@@ -1,27 +1,19 @@
[id="collecting-{prod-id-short}-metrics-with-prometheus_{context}"]
= Collecting {prod-short} metrics with Prometheus

Prometheus is a monitoring system that collects metrics in real time and stores them in a time series database.
This section describes how to use the Prometheus monitoring system to collect, store and query metrics about {prod-short}.

Prometheus comes with a console accessible at the `9090` port of the application pod. By default, a template provides an existing *service* and a *route* to access it. It can be used to query and view metrics.
.Prerequisites

ifeval::["{project-context}" == "che"]
image::monitoring/monitoring-che-prometheus-console.png[link="{imagesdir}/monitoring/monitoring-che-prometheus-console.png"]
endif::[]
* {prod-short} is exposing metrics on port `8087`. See xref:enabling-and-exposing-{prod-id-short}-metrics_{context}[].

== Prometheus terminology
* Prometheus 2.9.1 or above is running. Prometheus console is running on port `9090` with a corresponding *service* and *route*. See link:https://prometheus.io/docs/introduction/first_steps/[First steps with Prometheus].

Prometheus offers:
.Procedure

counter:: the simplest numerical type of metric whose value can be only increased. A typical example is counting the amount of HTTP requests that go through the system.

gauge:: numerical value that can be increased or decreased. Best suited for representing values of objects.

histogram:: a more complex metric that is suited for performing observations. Metrics are collected and grouped in configurable buckets, which allwos to present the results, for instance, in a form of a heatmap.

== Configuring Prometheus

.Prometheus configuration
* Configure Prometheus to scrape metrics from the `8087` port
+
.Prometheus configuration example
[source,yaml,subs="+attributes"]
----
- apiVersion: v1
Expand All @@ -33,10 +25,26 @@ histogram:: a more complex metric that is suited for performing observations. Me
scrape_configs:
- job_name: 'che'
static_configs:
- targets: ['{prod-host}:8087']
- targets: ['{prod-host}:8087'] <3>
kind: ConfigMap
metadata:
name: prometheus-config
----
+
<1> rate, at which a target is scraped
<2> rate, at which recording and alerting rules are re-checked (not used in our system at the moment)
<3> scrape metrics from the `8087` port

.Verification steps

* Use the Prometheus console to query and view metrics.

.Additional resources

* link:https://prometheus.io/docs/introduction/first_steps/[First steps with Prometheus].

* link:https://prometheus.io/docs/prometheus/latest/configuration/configuration/[Configuring Prometheus].

* link:https://prometheus.io/docs/prometheus/latest/querying/basics/[Querying Prometheus].

* link:https://prometheus.io/docs/concepts/metric_types/[Prometheus metric types].
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[id="enabling-and-exposing-{prod-id-short}-metrics_{context}"]
= Enabling and exposing {prod-short} metrics

This section describes how to enable and expose {prod-short} metrics.

.Procedure

. Set the `CHE_METRICS_ENABLED=true` environment variable

. Expose the `8087` port as a service on the che-master host
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
[id="extending-{prod-id-short}-monitoring-metrics_{context}"]
= Extending {prod-short} monitoring metrics

This section describes how to create a metric or a group of metrics to extend the monitoring metrics that {prod-short} is exposing.

There are two major modules for metrics:

* `che-core-metrics-core` -- contains core metrics module
Expand All @@ -9,10 +11,10 @@ There are two major modules for metrics:

.Procedure

To create a metric or a group of metrics, you need a class that extends the `MeterBinder` class. This allows to register the created metric in the overriden `bindTo(MeterRegistry registry)` method.

* Create a class that extends the `MeterBinder` class. This allows to register the created metric in the overriden `bindTo(MeterRegistry registry)` method.
+
The following is an example of a metric that has a function that supplies the value for it:

+
.Example metric
[source,java]
----
Expand Down Expand Up @@ -40,13 +42,11 @@ public class UserMeterBinder implements MeterBinder {
}
}
----

+
Alternatively, the metric can be stored with a reference and updated manually in some other place in the code.


.Additional resources

For more information about the types of metrics and naming conventions, visit Prometheus documentation:

* link:https://prometheus.io/docs/practices/naming/[Naming practices]
* link:https://prometheus.io/docs/concepts/metric_types/[Metric types]
* link:https://prometheus.io/docs/practices/naming/[Metric and label naming for Prometheus]
* link:https://prometheus.io/docs/concepts/metric_types/[Metric types for Prometheus]
Original file line number Diff line number Diff line change
@@ -1,174 +1,33 @@
[id="viewing-{prod-id-short}-metrics-on-grafana-dashboards_{context}"]
= Viewing {prod-short} metrics on Grafana dashboards

Grafana is used for informative representation of Prometheus metrics. Providing visibility for OpenShift, Grafana’s deployment configuration and ConfigMaps are located in the `che-monitoring.yaml` configuration file.
This section describes how to view {prod-short} metrics on Grafana dashboards.

.Prerequisites

== Configuring and deploying Grafana
* Prometheus is collecting metrics on the {prod-short} cluster. See xref:collecting-{prod-id-short}-metrics-with-prometheus_{context}[].

Grafana is run on port `3000` with a corresponding *service* and *route*.
* Grafana 6.0 or above is running on port `3000` with a corresponding *service* and *route*. See link:https://grafana.com/docs/installation/[Installing Grafana].

Three ConfigMaps are used to configure Grafana:

.Procedure

. Deploy {prod-short}-specific dashboards on Grafana using the `che-monitoring.yaml` configuration file.
+
Three ConfigMaps are used to configure Grafana:
+
* `grafana-datasources` -- configuration for Grafana datasource, a Prometheus endpoint
* `grafana-dashboards` -- configuration of Grafana dashboards and panels
* `grafana-dashboard-provider` -- configuration of the Grafana dashboard provider API object, which tells Grafana where to look in the file system for pre-provisioned dashboards


== Grafana dashboards overview

{prod-short} provides several types of dashboards.


=== {prod-short} server dashboard

Use case: {prod-short} server-specific metrics related to {prod-short} components, such as workspaces or users.

.The *General* panel
image::monitoring/monitoring-che-che-server-dashboard-general-panel.png[]

The *General* panel contains basic information, such as the total number of users and workspaces in the {prod-short} database.

.The *Workspaces* panel
image::monitoring/monitoring-che-che-server-dashboard-workspace-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-workspace-panel.png"]

* *Workspace start rate* -- the ratio between successful and failed started workspaces
* *Workspace stop rate* -- the ratio between successful and failed stopped workspaces
* *Workspace Failures* -- the number of workspace failures shown on the graph
* *Starting Workspaces* -- the gauge that shows the number of currently starting workspaces
* *Average Workspace Start Time* -- 1-hour average of workspace starts or fails
* *Average Workspace Stop Time* -- 1-hour average of workspace stops
* *Running Workspaces* -- the gauge that shows the number of currently running workspaces
* *Stopping Workspaces* -- the gauge that shows the number of currently stopping workspaces
* *Workspaces started under 60 seconds* -- the percentage of workspaces started under 60 seconds
* *Number of Workspaces* -- the number of workspaces created over time

.The *Users* panel
image::monitoring/monitoring-che-che-server-dashboard-users-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-users-panel.png"]

* *Number of Users* -- the number of users known to {prod-short} over time


.The *Tomcat* panel
image::monitoring/monitoring-che-che-server-dashboard-tomcat-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-tomcat-panel.png"]

* *Max number of active sessions* -- the max number of active sessions that have been active at the same time
* *Number of current active sessions* -- the number of currently active sessions
* *Total sessions* -- the total number of sessions
* *Expired sessions* -- the number of sessions that have expired
* *Rejected sessions* -- the number of sessions that were not created because the maximum number of active sessions was reached
* *Longest time of an expired session* -- the longest time (in seconds) that an expired session had been alive

.The *Request* panel
image::monitoring/monitoring-che-che-server-dashboard-requests-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-requests-panel.png"]

The *Requests* panel displays HTTP requests in a graph that shows the average number of requests per minute.

.The *Executors* panel, part 1
image::monitoring/monitoring-che-che-server-dashboard-executors-panel-1.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-1.png"]

* *Threads running* - the number of threads that are not terminated aka alive. May include threads that are in a waiting or blocked state.
* *Threads terminated* - the number of threads that was finished its execution.
* *Threads created* - number of threads created by thread factory for given executor service.
* *Created thread/minute* - Speed of thread creating for the given executor service.

.The *Executors* panel, part 2
image::monitoring/monitoring-che-che-server-dashboard-executors-panel-2.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-2.png"]

* *Executor threads active* - number of threads that actively execute tasks.
* *Executor pool size* - number of threads that actively execute tasks.
* *Queued task* - the approximate number of tasks that are queued for execution
* *Queued occupancy* - the percent of the queue used by the tasks that is waining for execution.

.The *Executors* panel, part 3
image::monitoring/monitoring-che-che-server-dashboard-executors-panel-3.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-3.png"]

* *Rejected task* - the number of tasks that were rejected from execution.
* *Rejected task/minute* - the speed of task rejections
* *Completed tasks* - the number of completed tasks
* *Completed tasks/minute* - the speed of task execution

.The *Executors* panel, part 4
image::monitoring/monitoring-che-che-server-dashboard-executors-panel-4.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-4.png"]

* *Task execution seconds max* - 5min moving maximum of task execution
* *Tasks execution seconds avg* - 1h moving average of task execution
* *Executor idle seconds max* - 5min moving maximum of executor idle state.
* *Executor idle seconds avg* - 1h moving average of executor idle state.

.The *Traces* panel, part 1
image::monitoring/monitoring-che-che-server-dashboard-trace-panel-1.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-1.png"]

* *Workspace start Max* - maximum workspace start time
* *Workspace start Avg* - 1h moving average of the workspace start time components
* *Workspace stop Max* - maximum of workspace stop time
* *Workspace stop Avg* - 1h moving average of the workspace stop time components

.The *Traces* panel, part 2
image::monitoring/monitoring-che-che-server-dashboard-trace-panel-2.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-2.png"]

* *OpenShiftInternalRuntime#start Max* - maximum time of OpenShiftInternalRuntime#start operation
* *OpenShiftInternalRuntime#start Avg* - 1h moving average time of OpenShiftInternalRuntime#start operation
* *Plugin Brokering Execution Max* - maximum time of PluginBrokerManager#getTooling operation
* *Plugin Brokering Execution Avg* - 1h moving average of PluginBrokerManager#getTooling operation

.The *Traces* panel, part 3
image::monitoring/monitoring-che-che-server-dashboard-trace-panel-3.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-3.png"]

* *OpenShiftEnvironmentProvisioner#provision Max* - maximum time of OpenShiftEnvironmentProvisioner#provision operation
* *OpenShiftEnvironmentProvisioner#provision Avg* -1h moving average of OpenShiftEnvironmentProvisioner#provision operation
* *Plugin Brokering Execution Max* - maximum time of PluginBrokerManager#getTooling components execution time
* *Plugin Brokering Execution Avg* - 1h moving average of time of PluginBrokerManager#getTooling components execution time

.The *Traces* panel, part 4
image::monitoring/monitoring-che-che-server-dashboard-trace-panel-4.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-4.png"]

* *WaitMachinesStart Max* - maximim time of WaitMachinesStart operations
* *WaitMachinesStart Avg* - 1h moving average time of WaitMachinesStart operations
* *OpenShiftInternalRuntime#startMachines Max* - maximim time of OpenShiftInternalRuntime#startMachines operations
* *OpenShiftInternalRuntime#startMachines Avg* - 1h moving average of the time of OpenShiftInternalRuntime#startMachines operations

.The *Workspace detailed* panel
image::monitoring/monitoring-che-che-server-dashboard-workspace-detailed-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-workspace-detailed-panel.png"]

The *Workspace Detailed* panel contains heat maps, which illustrate the average time of workspace starts or fails. The row shows some period of time.


=== {prod-short} server JVM dashboard

Use case: JVM metrics of the {prod-short} server, such as JVM memory or classloading.

.{prod-short} server JVM dashboard
image::monitoring/monitoring-che-che-server-jvm-dashboard.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard.png"]

.Quick Facts
image::monitoring/monitoring-che-che-server-jvm-dashboard-quick-facts.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-quick-facts.png"]

.JVM Memory
image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory.png"]

.JVM Misc
image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-misc.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-misc.png"]

.JVM Memory Pools (heap)
image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-heap.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-heap.png"]

.JVM Memory Pools (Non-Heap)
image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-non-heap.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-non-heap.png"]

.Garbage Collection
image::monitoring/monitoring-che-che-server-jvm-dashboard-garbage-collection.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-garbage-collection.png"]

.Classloading
image::monitoring/monitoring-che-che-server-jvm-dashboard-classloading.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-classloading.png"]
.Verification steps

.Buffer Pools
image::monitoring/monitoring-che-che-server-jvm-dashboard-buffer-pools.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-buffer-pools.png"]
* Use the Grafana console to view {prod-short} metrics.

.Additional resources

// [discrete]
// == Additional resources
//
// * A bulleted list of links to other material closely related to the contents of the procedure module.
// * For more details on writing procedure modules, see the link:https://github.com/redhat-documentation/modular-docs#modular-documentation-reference-guide[Modular Documentation Reference Guide].
// * Use a consistent system for file names, IDs, and titles. For tips, see _Anchor Names and File Names_ in link:https://github.com/redhat-documentation/modular-docs#modular-documentation-reference-guide[Modular Documentation Reference Guide].
* link:https://grafana.com/docs/installation/[Installing Grafana].
Loading

0 comments on commit a588080

Please sign in to comment.