-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring Grafana #3302
Comments
++ |
👍 |
make sure the health url does not generate sessions |
👍 |
+1 , this would be very useful to run grafana behind loadbalancer, loadbalancer will call the /health HTTP to verify is grafana returns HTTP 200 OK. |
I've put together something dead simple, but I'm not particularly happy with it at the moment. If anyone would like to take a look at current state vs master: master...theangryangel:feature/health_check It returns something like:
The database check I was originally returning some stats, but I've cut that out. I could switch the query to something much simpler like "select 1" and checking it doesn't error. Not sure if it's worth it. The session check I'm not particularly happy with either. There doesn't seem to be an easy to test without standing up a test macaron server and recover()ing from the panic that it would throw when starting a session provider, or modifying macaron/session to add a test feature to each of the providers. As it is right now it irritating returns a Set-Cookie header, which I don't particularly want. I'd appreciate some input where to take this from someone more experienced with macaron 😞 Checking for data sources doesn't seem particularly sane to try through this given how grafana is written. Probably more sane to add to your regular monitoring system. |
I was facing the same issue and as a workaround, I use an API call from the load balancer with a dedicated authentication API key. I'm using HAProxy, which has some useful "hidden" feature of setting custom HTTP headers in
(I need to use HTTP/1.0 rather than 1.1, since the latter requires setting Host header and I can't get it dynamically in HAProxy config).
|
Any progress or PR on this issue? |
+1 |
I would split this into a separate /liveness and /readiness endpoint as is best practice in kubernetes. /liveness only indicates whether grafana itself is up and running, /readiness indicates whether its ready to receive traffic and will check whether its dependencies are reachable. In kubernetes the liveness endpoint will be probed and when failing a number of times to respond with 200 ok the container will be killed and replaced with a new one. The readiness endpoint is used to make the container part of a service and send traffic its way. Like adding and removing it from a load balancer. |
+1 |
what about adding a /metrics Prometheus endpoint? |
+1 |
For whoever needs health checks on some services like Amazon ECS: |
+1 |
In the mean time if you're only looking for a simple |
Think everyone knows how to technically imply this but the point is to explicitly support monitoring of service health including external dependencies.
…Sent from my iPhone
On Dec 5, 2016, at 4:09 PM, Hunter Satterwhite ***@***.***> wrote:
If you're looking for a simple HTTP code: 200, then just use /login. My colleague and I just deployed Grafana to a Kubernetes cluster and using that endpoint worked just fine for the liveness/readiness probes. Also works for the Google Compute Engine load balancer.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I'd like to add our specific use case: we need a simple HTTP endpoint for checking if a user can login and display graphs. I know that we can use the static resources and endpoints such as |
+1 to this.
…On Mon, Dec 5, 2016 at 11:51 PM, Philip Wernersbach < ***@***.***> wrote:
I'd like to add our specific use case: we need a simple HTTP endpoint for
checking if a user can login and display graphs. I know that we can use the
static resources and endpoints such as /login to work around the absence
of this, but we really need something that checks that the Grafana
internals are running as expected. We don't necessarily need status checks
for retrieving data from data sources, as we have separate health checks
for those.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3302 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIESgm7BZw3jqs8ElVWU9v7CjtcXBYFwks5rFOm-gaJpZM4Gm4T8>
.
--
[image: TransLoc_logos_gear-blue_600x600.png]
Hunter Satterwhite
Lead Build & Operations Engineer, TransLoc
Cell: 252.762.5177 | http://transloc.com <http://www.transloc.com/>
[image: social media icons-03.png] <https://www.facebook.com/TransLoc/> [image:
social media icons-04.png] <https://www.linkedin.com/company/transloc> [image:
social media icons-02.png] <http://www.twitter.com/transloc> [image: social
media icons-01.png] <http://www.instagram.com/transloc_inc>
|
So there is currently in 4.0 a /api/metrics endpoint with some internal metrics. But the issue requests something like this
Would be good with a more detailed description for what is expected here. Should the API health call do a live check with all data sources in all orgs? should it be done on the fly as the /health api call is made? |
@torkelo going to toss out an idea but definitely think /health should allow for both grafana-server as well as installed plugins to register arbitrary things to report on:
By default, health checks perform live checks of all things when endpoint is called. If people want to isolate health checks to specific things, you can do something like elasticsearch does for cluster health. When thing is an external service (authorization, database, etc), then connectivity test is done at the minimum and any other sanity check that is reasonable for thing (e.g. SELECT 1 for database, LDAP bind test for authorization, etc). Having output like this will allow monitoring checks to check holistically for issues while finding specific problems and output accordingly. |
+1 |
@torkelo sorry for the delayed answer just saw your questions. TL;DR The end point (or end points) used to monitor Grafana should answer 2 questions with details: "configuration intents" is key here, what I mean by intent is that when for example the admin adds as a data source she expects it to be available regardless of whether or not the saved configuration is right. Thus if a configured data source is not available to Grafana the monitoring end point should say so and why, in the same fashion the extremely useful "test" button works. It helps me think in terms of a plane taking off, first I need to know the plane has finished taking off and is in the air, then I need to know the plane is flying towards its destination as expected (let's not get into what "reaching cruise altitude" means ;-) ) This can be somewhat be compared to the /live /ready others have pointed out or /health (1) /state (2) of the Elasticsearch model or /health and /info of Sensu (3). Now regarding the details of each answer, here are my -non exhaustive- initial thoughts:
B details:
There is much more that can go in B which is why breaking the monitoring into 2 end points might make more sense, meh. As to how to go about what happens when the end point is being queried (on the fly, APIs ,etc), I would defer to who ever ends up implementing. A couple of - obvious?- advices though:
(1) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html |
So 4.2.0 just came out and there still is no way to probe the service? (think k8s cluster) |
@torkelo I think @dynek has a point, this is not optional anymore. Whether it's a new section in the docs dedicated to "how to monitor Grafana" where what can be done today with the existing instrumentation (e.g. leverage admin or metrics page) is documented or a full fleshed dedicated API like in this proposal we need something yesterday. |
+1 |
…creates sessions, returns db status as well. #3302
Added a simple http endpoint to check grafana health:
If database (mysql/postgres/sqlite3) is not reachable it will return "failing" in the The most important thing about this endpoint is that it will never cause sessions to be created (Something other api calls might do if you do not call them with an api key or basic auth). |
Wouldn't it be best to return with status code 503 when the database is unreachable? |
Kubernetes uses:
|
Yes, I think 503 status code when db status failed is best, will update |
The 503 means the |
@JorritSalverda you could probably use |
|
we typically have agressive readiness checks and relaxed liveness checks, 1 second, 1 fail for readiness, while it's 60 seconds 10 fails 1 success for liveness, this allows for responsive rerouting when there is an issue, but at the same time if self recovery is possible, prevents unnecessary pod restarts. But a persistent DB issue would cause restart which might actually help if it was due to some bad container state. |
Document the health check implemented in grafana#3302 (and grafana#935), see grafana#3302 (comment)
Document the health check implemented in #3302 (and #935), see #3302 (comment)
@finkr |
@suridaddy : it might be easier to visit the Grafana community forums or the more interactive support channels along with more information to troubleshoot your problem. This issue is for feature / improvement and is closed. |
It's time to monitor the monitoring! It'd be great to have a /status or /health endpoint that returns grafana health data as json.
Things I'd like to get from a status endpoint are:
e.g:
/status
{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }
The text was updated successfully, but these errors were encountered: