Clarifications in monitoring chapter (#1000)

* Clarifications in monitoring chapter * Update src/main/pages/che-7/administration-guide/assembly_monitoring-che.adoc Co-Authored-By: Florent BENOIT <[email protected]> * handle feedback from @skabashnyuk - add visibility to enabling and exposing che metrics * rework Viewing Che metrics on Grafana dashboards Co-authored-by: Florent BENOIT <[email protected]>
eclipse-che · Dec 23, 2019 · a588080 · a588080
1 parent 40fb6b1
commit a588080
Show file tree

Hide file tree

Showing 6 changed files with 208 additions and 216 deletions.
diff --git a/src/main/pages/che-7/administration-guide/assembly_monitoring-che.adoc b/src/main/pages/che-7/administration-guide/assembly_monitoring-che.adoc
@@ -15,52 +15,22 @@ summary:
 
 :context: monitoring-che
 
-{prod-short} can expose certain data as metrics, that can be processed by Prometheus
-ifeval::["{project-context}" == "che"]
-and Grafana stack
-endif::[]
-. 
-
-Prometheus is a monitoring system, that maintains the collection of metrics - time series key-value data which can represent consumption of resources like CPU and memory, amount of processed HTTP queries and their execution time, and {prod-short} specific resources, such as number of users and workspaces, the start and shutdown of workspaces, information about JsonRPC stack.
-
-Prometheus is powered with a special query language, that allows manipulating the collected data, and perform various binary, vector and aggregation operations with it, to help create a more refined view on data.
-
-ifeval::["{project-context}" == "che"]
-Grafana offers a front-end "facade" with tools to create a various visual representation in the form of dashboards with various panels and graph types.
-endif::[]
-
-Note that this monitoring stack is not an official production-ready solution, but rather has an introduction purpose.
-
-.The structure of {prod-short} monitoring stack
-image::monitoring/monitoring-che-stack-structure.png[link="{imagesdir}/monitoring/monitoring-che-stack-structure.png"]
+This chapter describes how to configure {prod-short} to expose metrics and how to build an example monitoring stack with external tools to process data exposed as metrics by {prod-short}. 
 
-[id="enabling-{prod-id-short}-metrics-collections"]
-== Enabling {prod-short} metrics collections
-
-[id='prerequisites-{context}',discrete]
-.Prerequisites
-
-* Installed Prometheus 2.9.1 or above. See more link:https://prometheus.io/docs/introduction/first_steps/[https://prometheus.io/docs/introduction/first_steps/].
-* Installed Grafana 6.0 or above. See more at link:https://grafana.com/docs/installation/[https://grafana.com/docs/installation/]
-
-.Procedure
-
-. Set the `CHE_METRICS_ENABLED=true` environment variable
-. Expose the `8087` port as a service on the che-master host
-. Configure Prometheus to scrape metrics from the `8087` port
-. Configure a Prometheus data source on Grafana
-. Deploy {prod-short}-specific dashboards on Grafana
+include::proc_enabling-and-exposing-che-metrics.adoc[leveloffset=+1]
 
 include::proc_collecting-che-metrics-with-prometheus.adoc[leveloffset=+1]
 
 ifeval::["{project-context}" == "che"]
 
 include::proc_viewing-che-metrics-on-grafana-dashboards.adoc[leveloffset=+1]
 
-include::proc_developing-grafana-dashboards.adoc[leveloffset=+1]
+include::ref_grafana-dashboards-for-che.adoc[leveloffset=+1]
 
-include::proc_extending-che-monitoring-metrics.adoc[leveloffset=+1]
+include::proc_developing-grafana-dashboards.adoc[leveloffset=+1]
 
 endif::[]
 
+include::proc_extending-che-monitoring-metrics.adoc[leveloffset=+1]
+
 :context: {parent-context-of-monitoring-che}
diff --git a/...ges/che-7/administration-guide/proc_collecting-che-metrics-with-prometheus.adoc b/...ges/che-7/administration-guide/proc_collecting-che-metrics-with-prometheus.adoc
@@ -1,27 +1,19 @@
 [id="collecting-{prod-id-short}-metrics-with-prometheus_{context}"]
 = Collecting {prod-short} metrics with Prometheus
 
-Prometheus is a monitoring system that collects metrics in real time and stores them in a time series database.
+This section describes how to use the Prometheus monitoring system to collect, store and query metrics about {prod-short}.
 
-Prometheus comes with a console accessible at the `9090` port of the application pod. By default, a template provides an existing *service* and a *route* to access it. It can be used to query and view metrics.
+.Prerequisites
 
-ifeval::["{project-context}" == "che"]
-image::monitoring/monitoring-che-prometheus-console.png[link="{imagesdir}/monitoring/monitoring-che-prometheus-console.png"]
-endif::[]
+* {prod-short} is exposing metrics on port `8087`. See xref:enabling-and-exposing-{prod-id-short}-metrics_{context}[].
 
-== Prometheus terminology
+* Prometheus 2.9.1 or above is running. Prometheus console is running on port `9090` with a corresponding *service* and *route*. See link:https://prometheus.io/docs/introduction/first_steps/[First steps with Prometheus].
 
-Prometheus offers:
+.Procedure
 
-counter:: the simplest numerical type of metric whose value can be only increased. A typical example is counting the amount of HTTP requests that go through the system.
-
-gauge:: numerical value that can be increased or decreased. Best suited for representing values of objects.
-
-histogram:: a more complex metric that is suited for performing observations. Metrics are collected and grouped in configurable buckets, which allwos to present the results, for instance, in a form of a heatmap.
-
-== Configuring Prometheus
-
-.Prometheus configuration
+* Configure Prometheus to scrape metrics from the `8087` port
++
+.Prometheus configuration example
 [source,yaml,subs="+attributes"]
 ----
 - apiVersion: v1
@@ -33,10 +25,26 @@ histogram:: a more complex metric that is suited for performing observations. Me
       scrape_configs:
         - job_name: 'che'
           static_configs:
-            - targets: ['{prod-host}:8087']
+            - targets: ['{prod-host}:8087'] <3>
   kind: ConfigMap
   metadata:
     name: prometheus-config
 ----
++
 <1> rate, at which a target is scraped
 <2> rate, at which recording and alerting rules are re-checked (not used in our system at the moment)
+<3> scrape metrics from the `8087` port
+
+.Verification steps
+
+* Use the Prometheus console to query and view metrics.
+
+.Additional resources
+
+* link:https://prometheus.io/docs/introduction/first_steps/[First steps with Prometheus].
+
+* link:https://prometheus.io/docs/prometheus/latest/configuration/configuration/[Configuring Prometheus].
+
+* link:https://prometheus.io/docs/prometheus/latest/querying/basics/[Querying Prometheus].
+
+* link:https://prometheus.io/docs/concepts/metric_types/[Prometheus metric types].
diff --git a/...in/pages/che-7/administration-guide/proc_enabling-and-exposing-che-metrics.adoc b/...in/pages/che-7/administration-guide/proc_enabling-and-exposing-che-metrics.adoc
@@ -0,0 +1,10 @@
+[id="enabling-and-exposing-{prod-id-short}-metrics_{context}"]
+= Enabling and exposing {prod-short} metrics
+
+This section describes how to enable and expose {prod-short} metrics. 
+
+.Procedure
+
+. Set the `CHE_METRICS_ENABLED=true` environment variable
+
+. Expose the `8087` port as a service on the che-master host
diff --git a/...ain/pages/che-7/administration-guide/proc_extending-che-monitoring-metrics.adoc b/...ain/pages/che-7/administration-guide/proc_extending-che-monitoring-metrics.adoc
@@ -1,6 +1,8 @@
 [id="extending-{prod-id-short}-monitoring-metrics_{context}"]
 = Extending {prod-short} monitoring metrics
 
+This section describes how to create a metric or a group of metrics to extend the monitoring metrics that {prod-short} is exposing.
+
 There are two major modules for metrics:
 
 * `che-core-metrics-core` -- contains core metrics module
@@ -9,10 +11,10 @@ There are two major modules for metrics:
 
 .Procedure
 
-To create a metric or a group of metrics, you need a class that extends the `MeterBinder` class. This allows to register the created metric in the overriden `bindTo(MeterRegistry registry)` method.
-
+* Create a class that extends the `MeterBinder` class. This allows to register the created metric in the overriden `bindTo(MeterRegistry registry)` method.
++
 The following is an example of a metric that has a function that supplies the value for it:
-
++
 .Example metric
 [source,java]
 ----
@@ -40,13 +42,11 @@ public class UserMeterBinder implements MeterBinder {
     }
   }
 ----
-
++
 Alternatively, the metric can be stored with a reference and updated manually in some other place in the code.
 
 
 .Additional resources
 
-For more information about the types of metrics and naming conventions, visit Prometheus documentation:
-
-* link:https://prometheus.io/docs/practices/naming/[Naming practices]
-* link:https://prometheus.io/docs/concepts/metric_types/[Metric types]
+* link:https://prometheus.io/docs/practices/naming/[Metric and label naming for Prometheus]
+* link:https://prometheus.io/docs/concepts/metric_types/[Metric types for Prometheus]
diff --git a/.../che-7/administration-guide/proc_viewing-che-metrics-on-grafana-dashboards.adoc b/.../che-7/administration-guide/proc_viewing-che-metrics-on-grafana-dashboards.adoc
@@ -1,174 +1,33 @@
 [id="viewing-{prod-id-short}-metrics-on-grafana-dashboards_{context}"]
 = Viewing {prod-short} metrics on Grafana dashboards
 
-Grafana is used for informative representation of Prometheus metrics. Providing visibility for OpenShift, Grafana’s deployment configuration and ConfigMaps are located in the `che-monitoring.yaml` configuration file.
+This section describes how to view {prod-short} metrics on Grafana dashboards.
 
+.Prerequisites
 
-== Configuring and deploying Grafana
+* Prometheus is collecting metrics on the {prod-short} cluster. See xref:collecting-{prod-id-short}-metrics-with-prometheus_{context}[].
 
-Grafana is run on port `3000` with a corresponding *service* and *route*.
+* Grafana 6.0 or above is running on port `3000` with a corresponding *service* and *route*. See link:https://grafana.com/docs/installation/[Installing Grafana].
 
-Three ConfigMaps are used to configure Grafana:
 
+.Procedure
+
+. Deploy {prod-short}-specific dashboards on Grafana using the `che-monitoring.yaml` configuration file.
++
+Three ConfigMaps are used to configure Grafana:
++
 * `grafana-datasources` -- configuration for Grafana datasource, a Prometheus endpoint
 * `grafana-dashboards` -- configuration of Grafana dashboards and panels
 * `grafana-dashboard-provider`  -- configuration of the Grafana dashboard provider API object, which tells Grafana where to look in the file system for pre-provisioned dashboards
 
 
-== Grafana dashboards overview
-
-{prod-short} provides several types of dashboards.
-
-
-=== {prod-short} server dashboard
-
-Use case: {prod-short} server-specific metrics related to {prod-short} components, such as workspaces or users.
-
-.The *General* panel
-image::monitoring/monitoring-che-che-server-dashboard-general-panel.png[]
-
-The *General* panel contains basic information, such as the total number of users and workspaces in the {prod-short} database.
-
-.The *Workspaces* panel
-image::monitoring/monitoring-che-che-server-dashboard-workspace-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-workspace-panel.png"]
-
-* *Workspace start rate* -- the ratio between successful and failed started workspaces
-* *Workspace stop rate* -- the ratio between successful and failed stopped workspaces
-* *Workspace Failures* -- the number of workspace failures shown on the graph
-* *Starting Workspaces* -- the gauge that shows the number of currently starting workspaces
-* *Average Workspace Start Time* -- 1-hour average of workspace starts or fails
-* *Average Workspace Stop Time* -- 1-hour average of workspace stops
-* *Running Workspaces* -- the gauge that shows the number of currently running workspaces
-* *Stopping Workspaces* -- the gauge that shows the number of currently stopping workspaces
-* *Workspaces started under 60 seconds* -- the percentage of workspaces started under 60 seconds
-* *Number of Workspaces* -- the number of workspaces created over time
-
-.The *Users* panel
-image::monitoring/monitoring-che-che-server-dashboard-users-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-users-panel.png"]
-
-* *Number of Users* -- the number of users known to {prod-short} over time
-
-
-.The *Tomcat* panel
-image::monitoring/monitoring-che-che-server-dashboard-tomcat-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-tomcat-panel.png"]
-
-* *Max number of active sessions* -- the max number of active sessions that have been active at the same time
-* *Number of current active sessions* -- the number of currently active sessions
-* *Total sessions* -- the total number of sessions
-* *Expired sessions* -- the number of sessions that have expired
-* *Rejected sessions* -- the number of sessions that were not created because the maximum number of active sessions was reached
-* *Longest time of an expired session* -- the longest time (in seconds) that an expired session had been alive
-
-.The *Request* panel
-image::monitoring/monitoring-che-che-server-dashboard-requests-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-requests-panel.png"]
-
-The *Requests* panel displays HTTP requests in a graph that shows the average number of requests per minute.
-
-.The *Executors* panel, part 1
-image::monitoring/monitoring-che-che-server-dashboard-executors-panel-1.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-1.png"]
-
-* *Threads running* - the number of threads that are not terminated aka alive. May include threads that are in a waiting or blocked state. 
-* *Threads terminated* - the number of threads that was finished its execution.
-* *Threads created* - number of threads created by thread factory for given executor service.
-* *Created thread/minute* - Speed of thread creating for the given executor service.
-
-.The *Executors* panel, part 2
-image::monitoring/monitoring-che-che-server-dashboard-executors-panel-2.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-2.png"]
-
-* *Executor threads active* - number of threads that actively execute tasks.
-* *Executor pool size* - number of threads that actively execute tasks.
-* *Queued task* - the approximate number of tasks that are queued for execution
-* *Queued occupancy* - the percent of the queue used by the tasks that is waining for execution.
-
-.The *Executors* panel, part 3
-image::monitoring/monitoring-che-che-server-dashboard-executors-panel-3.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-3.png"]
-
-* *Rejected task* - the number of tasks that were rejected from execution.
-* *Rejected task/minute* - the speed of task rejections
-* *Completed tasks* - the number of completed tasks
-* *Completed tasks/minute* - the speed of task execution
-
-.The *Executors* panel, part 4
-image::monitoring/monitoring-che-che-server-dashboard-executors-panel-4.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-executors-panel-4.png"]
-
-* *Task execution seconds max* - 5min moving maximum of task execution
-* *Tasks execution seconds avg* - 1h moving average of task execution
-* *Executor idle seconds max* - 5min moving maximum of executor idle state.
-* *Executor idle seconds avg* - 1h moving average of executor idle state.
-
-.The *Traces* panel, part 1
-image::monitoring/monitoring-che-che-server-dashboard-trace-panel-1.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-1.png"]
-
-* *Workspace start Max* - maximum workspace start time
-* *Workspace start Avg* - 1h moving average of the workspace start time components
-* *Workspace stop Max* - maximum of workspace stop time
-* *Workspace stop Avg* - 1h moving average of the workspace stop time components
-
-.The *Traces* panel, part 2
-image::monitoring/monitoring-che-che-server-dashboard-trace-panel-2.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-2.png"]
-
-* *OpenShiftInternalRuntime#start Max* - maximum time of OpenShiftInternalRuntime#start operation
-* *OpenShiftInternalRuntime#start Avg* - 1h moving average time of OpenShiftInternalRuntime#start operation
-* *Plugin Brokering Execution Max* - maximum time of PluginBrokerManager#getTooling operation
-* *Plugin Brokering Execution Avg* - 1h moving average of PluginBrokerManager#getTooling operation
-
-.The *Traces* panel, part 3
-image::monitoring/monitoring-che-che-server-dashboard-trace-panel-3.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-3.png"]
-
-* *OpenShiftEnvironmentProvisioner#provision Max* - maximum time of OpenShiftEnvironmentProvisioner#provision operation
-* *OpenShiftEnvironmentProvisioner#provision Avg* -1h moving average of OpenShiftEnvironmentProvisioner#provision operation
-* *Plugin Brokering Execution Max* - maximum time of PluginBrokerManager#getTooling components execution time
-* *Plugin Brokering Execution Avg* - 1h moving average of time of PluginBrokerManager#getTooling components execution time
-
-.The *Traces* panel, part 4
-image::monitoring/monitoring-che-che-server-dashboard-trace-panel-4.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-trace-panel-4.png"]
-
-* *WaitMachinesStart Max* - maximim time of WaitMachinesStart operations
-* *WaitMachinesStart Avg* - 1h moving average time of WaitMachinesStart operations
-* *OpenShiftInternalRuntime#startMachines Max* - maximim time of OpenShiftInternalRuntime#startMachines operations
-* *OpenShiftInternalRuntime#startMachines Avg* - 1h moving average of the time of OpenShiftInternalRuntime#startMachines operations
-
-.The *Workspace detailed* panel
-image::monitoring/monitoring-che-che-server-dashboard-workspace-detailed-panel.png[link="{imagesdir}/monitoring/monitoring-che-che-server-dashboard-workspace-detailed-panel.png"]
-
-The *Workspace Detailed* panel contains heat maps, which illustrate the average time of workspace starts or fails. The row shows some period of time.
-
-
-=== {prod-short} server JVM dashboard
-
-Use case: JVM metrics of the {prod-short} server, such as JVM memory or classloading.
-
-.{prod-short} server JVM dashboard
-image::monitoring/monitoring-che-che-server-jvm-dashboard.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard.png"]
-
-.Quick Facts
-image::monitoring/monitoring-che-che-server-jvm-dashboard-quick-facts.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-quick-facts.png"]
-
-.JVM Memory
-image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory.png"]
-
-.JVM Misc
-image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-misc.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-misc.png"]
-
-.JVM Memory Pools (heap)
-image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-heap.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-heap.png"]
 
-.JVM Memory Pools (Non-Heap)
-image::monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-non-heap.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-jvm-memory-pools-non-heap.png"]
 
-.Garbage Collection
-image::monitoring/monitoring-che-che-server-jvm-dashboard-garbage-collection.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-garbage-collection.png"]
 
-.Classloading
-image::monitoring/monitoring-che-che-server-jvm-dashboard-classloading.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-classloading.png"]
+.Verification steps
 
-.Buffer Pools
-image::monitoring/monitoring-che-che-server-jvm-dashboard-buffer-pools.png[link="{imagesdir}/monitoring/monitoring-che-che-server-jvm-dashboard-buffer-pools.png"]
+* Use the Grafana console to view {prod-short} metrics.
 
+.Additional resources
 
-// [discrete]
-// == Additional resources
-// 
-// * A bulleted list of links to other material closely related to the contents of the procedure module.
-// * For more details on writing procedure modules, see the link:https://github.com/redhat-documentation/modular-docs#modular-documentation-reference-guide[Modular Documentation Reference Guide].
-// * Use a consistent system for file names, IDs, and titles. For tips, see _Anchor Names and File Names_ in link:https://github.com/redhat-documentation/modular-docs#modular-documentation-reference-guide[Modular Documentation Reference Guide].
+* link:https://grafana.com/docs/installation/[Installing Grafana].