Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue-465 Create a documentation section to use Grafana DataSource with SonataFlow Prometheus metrics #693

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jianrongzhang89
Copy link
Contributor

@jianrongzhang89 jianrongzhang89 commented Dec 10, 2024

Fix apache/incubator-kie-kogito-serverless-operator#465

Update the document to include Prometheus and Grafana installation, and Grafana Data Source congfiguration and import the default dashboard.

  • You have read the contributions doc
  • Pull Request title is properly formatted: Issue-XYZ Subject
  • Pull Request title contains the target branch if not targeting main: [0.9.x] Issue-XYZ Subject
  • The nav.adoc file has a link to this guide in the proper category
  • The index.adoc file has a card to this guide in the proper category, with a meaningful description

@ricardozanini
Copy link
Member

@jianrongzhang89 can you please take a look on CI?

@jianrongzhang89 jianrongzhang89 force-pushed the monitoring branch 2 times, most recently from 3250a02 to d95b765 Compare December 11, 2024 10:22
Copy link
Contributor

github-actions bot commented Dec 11, 2024

🎊 PR Preview ea1866d has been successfully built and deployed. See the documentation preview: https://sonataflow-docs-preview-pr-693.surge.sh

@jianrongzhang89
Copy link
Contributor Author

@jianrongzhang89 can you please take a look on CI?

@ricardozanini fixed CI errors.

Copy link
Contributor

@wmedvede wmedvede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed all the document for the OpenShift installation and worked fine.
See image below with my workflows.

image

Guide is working.
LGTM

image::cloud/operator/monitoring/grafana-dashboard-example.png[]

=== Customize or build your own dashboard
You can customize or build your own dashboard. For more information, see xref:https://grafana.com/docs/grafana/latest/dashboards[Grafana Dashboards] and xref:cloud/operator/sonataflow-metrics.adoc[SonataFlow Metrics].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link: xref:https://grafana.com/docs/grafana/latest/dashboards[Grafana Dashboards] is not working.
I think that for external links you must use the link: tag instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

== Additional resources

* xref:cloud/operator/sonataflow-metrics.adoc[SonataFlow Metrics]
* xref:https://grafana.com/docs/grafana/latest/dashboards[Grafana Dashboards]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, non working link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

secureJsonData:
httpHeaderValue1: 'Bearer ${TOKEN}'
name: Prometheus
url: https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, access to metrics is in the end "protected", and can be accessed only if we give the cluster-monitoring-view to grafana, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct: the metrics are exposed to Prometheus but are not available in Grafana.

@@ -0,0 +1,99 @@
= SonataFlow Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this page is not shown in the menu Cloud -> Operator or in any other menu entry.

On the other hand, we have a sort of metrics page that shows the metrics and that I personally don't like 100%.
I think that what we must do, is to do a kind of merge between this metrics content and what is shown in the page below, and provide something good.
But , it's out of from @jianrongzhang89 scope.

see below:
image

Feel free to merge as is, to not loose this content and we can restructure in a followup PR.
@ricardozanini @domhanak

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricardozanini @wmedvede @domhanak I tried to merge them. Please review.

@wmedvede
Copy link
Contributor

Would you mind check the procedure for regular Kubernetes clusters? @domhanak

@jianrongzhang89 jianrongzhang89 force-pushed the monitoring branch 2 times, most recently from eb1bac5 to 4892c25 Compare December 11, 2024 21:10
@@ -0,0 +1,134 @@
== Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianrongzhang89 this common sonataflow_metrics document is great, many thanks.
My only observation is that the order of occurrence of each metric in the document, is not the same as the one being shown in the initial paragraph, which somehow corresponds with the workflow "natural" life-cycle.

see:
Screenshot from 2024-12-12 13-18-27

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Done!


== Metrics Description
=== kogito_process_instance_completed_total
Workflow instances that have reached a terminal status, “Aborted” or “Completed”, and thus are considered as completed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Workflow instances that have reached a terminal status, Aborted or Completed, and thus are considered as completed.
Workflow instances that have reached a terminal status, `Aborted` or `Completed`, and thus are considered as completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted


[NOTE]
====
These are the only two terminal status. The “Error” state is not terminal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These are the only two terminal status. The Error state is not terminal.
These are the only two terminal status. The `Error` state is not terminal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted


[NOTE]
====
This includes workflow instances that are in the "Error" state, since the error state is not a terminal state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This includes workflow instances that are in the "Error" state, since the error state is not a terminal state.
This includes workflow instances that are in the `Error` state, since the error state is not a terminal state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted

[NOTE]
====
This includes workflow instances that are in the "Error" state, since the error state is not a terminal state.
Process instances that have reached a terminal status, i.e. "Completed" or "Aborted", are not present in this metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Process instances that have reached a terminal status, i.e. "Completed" or "Aborted", are not present in this metric.
Process instances that have reached a terminal status, i.e. `Completed` or `Aborted`, are not present in this metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted

----

=== kogito_process_instance_duration_seconds
Calculates duration of a workflow instance that has reached a terminal state,, i.e. "Aborted" or "Completed". This metric is registered when the process reaches the terminal state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Calculates duration of a workflow instance that has reached a terminal state,, i.e. "Aborted" or "Completed". This metric is registered when the process reaches the terminal state.
Calculates duration of a workflow instance that has reached a terminal state, i.e. `Aborted` or `Completed`. This metric is registered when the process reaches the terminal state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted

= Monitoring Workflows
:compat-mode!:
// Metadata:
:description: Workflows monitoring configuration configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:description: Workflows monitoring configuration configuration
:description: Workflows monitoring configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted

Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks, @jianrongzhang89. This documentation seems good. Thanks, @wmedvede, for verifying the steps in the cluster!

@ricardozanini
Copy link
Member

@kaldesai mind taking a look too?

Copy link
Contributor

@wmedvede wmedvede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jianrongzhang89 , I couldn't evict adding some more nitpicks when re-reading 😄


In {product_name}, you can check the following metrics:

* `kogito_process_instance_started_total`: Number of started workflows (a workflow that has started might be running or completed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `kogito_process_instance_started_total`: Number of started workflows (a workflow that has started might be running or completed)
* `kogito_process_instance_started_total`: Number of started workflows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


* `kogito_process_instance_started_total`: Number of started workflows (a workflow that has started might be running or completed)
* `kogito_process_instance_running_total`: Number of running workflows
* `kogito_process_instance_completed_total`: Number of completed workflows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `kogito_process_instance_completed_total`: Number of completed workflows
* `kogito_process_instance_completed_total`: Number of completed workflows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* `kogito_process_instance_started_total`: Number of started workflows (a workflow that has started might be running or completed)
* `kogito_process_instance_running_total`: Number of running workflows
* `kogito_process_instance_completed_total`: Number of completed workflows
* `kogito_process_instance_error`: Number of workflows that report an error ( a workflow with an error might be still running or have been completed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `kogito_process_instance_error`: Number of workflows that report an error ( a workflow with an error might be still running or have been completed)
* `kogito_process_instance_error`: Number of workflows that report an error.

* `kogito_process_instance_running_total`: Number of running workflows
* `kogito_process_instance_completed_total`: Number of completed workflows
* `kogito_process_instance_error`: Number of workflows that report an error ( a workflow with an error might be still running or have been completed)
* `kogito_process_instance_duration_seconds`: Duration of a process instance in seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `kogito_process_instance_duration_seconds`: Duration of a process instance in seconds
* `kogito_process_instance_duration_seconds`: Duration of a workflow instance in seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* `kogito_process_instance_completed_total`: Number of completed workflows
* `kogito_process_instance_error`: Number of workflows that report an error ( a workflow with an error might be still running or have been completed)
* `kogito_process_instance_duration_seconds`: Duration of a process instance in seconds
* `kogito_node_instance_duration_milliseconds`: Duration of relevant nodes in milliseconds (a workflow is composed by nodes, user might be interested on the time consumed by an specific node type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `kogito_node_instance_duration_milliseconds`: Duration of relevant nodes in milliseconds (a workflow is composed by nodes, user might be interested on the time consumed by an specific node type)
* `kogito_node_instance_duration_milliseconds`: Duration of relevant nodes in milliseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


[NOTE]
====
Internally, workflows are referred as processes. Therefore, the `processId` and `processName` is workflow ID and name respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Internally, workflows are referred as processes. Therefore, the `processId` and `processName` is workflow ID and name respectively.
Internally, workflows are referred as processes. Therefore, the `processId` and `processName` are workflow id and name respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Internally, workflows are referred as processes. Therefore, the `processId` and `processName` is workflow ID and name respectively.
====

Each of the metrics mentioned previously contains a label for a specific workflow ID. For example, the `kogito_process_instance_completed_total` metric below contains the labels for `callbackstatetimeouts` workflow:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each of the metrics mentioned previously contains a label for a specific workflow ID. For example, the `kogito_process_instance_completed_total` metric below contains the labels for `callbackstatetimeouts` workflow:
Each of the metrics mentioned previously contains a label for a specific workflow id. For example, the `kogito_process_instance_completed_total` metric below contains the labels for `callbackstatetimeouts` workflow:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

----

=== kogito_process_instance_duration_seconds
Calculates duration of a workflow instance that has reached a terminal state,, i.e. `Aborted` or `Completed`. This metric is registered when the process reaches the terminal state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Calculates duration of a workflow instance that has reached a terminal state,, i.e. `Aborted` or `Completed`. This metric is registered when the process reaches the terminal state.
Calculates duration of a workflow instance that has reached a terminal state, i.e. `Aborted` or `Completed`. This metric is registered when the process reaches the terminal state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* `kogito_process_instance_error`: Number of workflows that report an error ( a workflow with an error might be still running or have been completed)
* `kogito_process_instance_duration_seconds`: Duration of a process instance in seconds
* `kogito_node_instance_duration_milliseconds`: Duration of relevant nodes in milliseconds (a workflow is composed by nodes, user might be interested on the time consumed by an specific node type)
* `sonataflow_input_parameters_counter`: Records input parameters, the occurrences of <"param_name","param_value"> per `processId`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `sonataflow_input_parameters_counter`: Records input parameters, the occurrences of <"param_name","param_value"> per `processId`.
* `sonataflow_input_parameters_counter_total`: Records input parameters, the occurrences of <"param_name","param_value"> per `processId`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

…ith SonataFlow Prometheus metrics: address review comments
Copy link
Contributor

@wmedvede wmedvede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RichardW98 a few nitpicks, and I have also re-installed the grafana dahsboard after these last modification.
Is working good, great work!

Just an observation regarding the dashboard, see screenshots please:

In my tests, I have these workflows: callbackstatetimeouts and callbackstatetimeouts-gitops.

The dashboard works fine:

image

However, in the filters a "greeting" value is shown.

Screenshot from 2024-12-20 10-39-10

Screenshot from 2024-12-20 10-39-17

@jianrongzhang89
Copy link
Contributor Author

@wmedvede I updated PR based on your above comments. Thanks.

…ith SonataFlow Prometheus metrics: address review comments
Copy link
Contributor

@wmedvede wmedvede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ricardozanini
Copy link
Member

@domhanak mind taking a look so we can close this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a documentation section to use Grafana DataSource with SonataFlow Prometheus metrics
3 participants