-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Added /-/healthy and /-/ready endpoints to all thanos components #656
Conversation
98785a9
to
f0fbd10
Compare
Awesome ! It's true that I only added the healthy route to the Querier, that was selfish 😄 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Ok, this is quite confusing as there are 2 types of healthchecks ready
and liveness
. In my opinion we should add both if we want to NOT confuse users. Similar to what Prometheus did here:
prometheus/pushgateway#135
and this discussion: prometheus/pushgateway#105
So basically we need /-/ready
and /-/healthy
, currently for most components it would be same handling (serve it in http metric server), however for store it needs to be more complex:
- liveness (healthy) needs to run from the beginning
- readiness (ready) needs to be OK only once we synced all meta files and start serving gRPC requests.
What do you think?
Again sorry for delay. Thank you for all the comments. Adding also the Regarding the readiness probe:
If we agree on the correct way to check for the ready state in every component I'd be happy to add it. |
d6e4b37
to
16fd343
Compare
So I refactored it to match (I hope) my suggestions in previous comment. Would something like that be acceptable? I'd add some tests but those |
It feels weird to have ̀registerHealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the direction is very nice, but some suggestions.
The major problem is that we have race issues here. Remember that in Golang almost nothing is atomic (thread safe) out of the box. Not even boolean
value. We need to change our code to be concurrent safe here as we set/read xxxIsReady
from different go routines.
We could wrap it with lock
or sync.Atomic
but actually we can build design something nicer, with the suggestion @adrien-f gave: To have generic method for this everywhere.
Let's focus what is generic. Generic is:
- registration. It is always on the same path but either on mux or router.
- Readiness and Healthiness handling based on
IsHealthy
or IsReady` methods.
This allows us to define struct that everyone will use with following methods:
type Prober struct {
readyMtx sync.Mutex
readiness error
healthyMtx sync.Mutex
healthiness error
}
func NewProbeInRouter(..) *Prober
func NewProbeInMux(..) *Prober
func (p *Prober) IsReady() error {
p.readyMtx.Lock()
defer p.readyMtx.Unlock()
return p.readiness
}
func (p *Prober) Ready() {
p.NotReady(nil)
}
func (p *Prober) NotReady(err error) {
p.readyMtx.Lock()
defer p.readyMtx.Unlock()
p.readiness = err
}
// etc...
What do you think? (:
b103c49
to
f518ced
Compare
Sorry about the delay I cannot find the time to finish this. Thank you so much for all the comments both of you. @bwplotka your suggestions on implementation were great so I implemented it as you suggested. At least I hope I understood you well :) The I'd be glad to discuss more if the points where I'm setting nodes ready and healthy are ok and should be added any more or moved possibly. Thanks for all the advises! |
f518ced
to
32d55d7
Compare
9680def
to
05b7eb2
Compare
ouch.. yep I'll take a look and re-base it so we can finish this off |
05b7eb2
to
d505781
Compare
a8d9a27
to
059759c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, but I am still seeing;
- non resolved comments
- readiness used in places where healthyness should be used? Lot's of inconsistencies IMO
@@ -133,8 +140,9 @@ func runSidecar( | |||
"msg", "failed to fetch initial external labels. Is Prometheus running? Retrying", | |||
"err", err, | |||
) | |||
readinessProber.SetNotReady(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... because we use metricHTTPListenGroup
its Ready, and then suddenly no ready here? I think it's quite nasty race.. As being marked rdy, and then suddenly not, means that container will be restarted, however we have retry here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, in this case it is bit unfortunate that's true.
Being marked not ready does not cause restart of the container that would cause being not healthy. But it could cause requests being sent to the sidecar even when hasn't yet fetched the external labels.
I'll leave just the readinessProber.SetHealthy()
set the readiness outside of the metricHTTPListenGroup
depending on each component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved away setting the ready status from the default http listener
5e9a4c4
@@ -172,32 +180,34 @@ func runSidecar( | |||
if err := m.UpdateLabels(iterCtx, logger); err != nil { | |||
level.Warn(logger).Log("msg", "heartbeat failed", "err", err) | |||
promUp.Set(0) | |||
readinessProber.SetNotReady(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this healthyness
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say so. You don't want to get restarted when Prometheus just doesn't respond for the external labels query or do you?
} else { | ||
// Update gossip. | ||
peer.SetLabels(m.LabelsPB()) | ||
|
||
promUp.Set(1) | ||
readinessProber.SetReady() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
return errors.Wrap(s.Serve(l), "serve gRPC") | ||
}, func(error) { | ||
}, func(err error) { | ||
readinessProber.SetNotReady(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, setting in one function like this is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also healthyness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand correctly what exactly do you mean by the setting in one function
.
You mean dropping at all changing the ready status because of prom ext labels fetch?
Also not sure about the readiness vs healthyness. The sidecar in this case could be still shipping some buckets to OS so killing it just because the gRPC interface has malfunction could be too harsh?
Let's fix this in later PR. It's not trivial
This is tricky. Why readiness fails? Not liveness? Also wonder if that is not too flaky.. but let's say it's ok |
059759c
to
0de1bba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this is looking pretty good, but a lot of behavior and it feels easy to miss something, but I think we can move forward with this, but I'd be good if @bwplotka can make a final call.
fixes #532 |
5e9a4c4
to
b51c5a5
Compare
Rebased on master |
b51c5a5
to
55ab8a4
Compare
pkg/prober/prober.go
Outdated
return prober | ||
} | ||
|
||
// HandleInMux registers readiness and liveness probes to mux. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems off. The method is called RegisterInRouter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it was leftover after refactoring.
pkg/prober/prober.go
Outdated
f(w, r) | ||
return | ||
} | ||
p.writeResponse(w, p.IsReady, "ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a small error here. By the time you call this. p.IsReady()
might start indicating that it is suddenly ready, right? AFAICT you need to do both of these actions while p.readyMtx
is locked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! The chances are really small but still this is a race. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL if this way it's ok with you
Signed-off-by: Martin Chodur <[email protected]>
Just to be clear, we agreed with @bwplotka that I'll split this PR to multiple smaller ones because this is too big to review and could cause various issues regarding number of changes in behavior. Thanks Bartek for making the call 👍 EDIT: Changing from PR to draft back is not possible unfortunately so added |
0bde2b7
to
b12f2d6
Compare
This was splitted into smaller PRs by @FUSAKLA Thanks for this! I think this means we can close this one? (: |
Yes definitely to avoid confusion, thanks 👍 |
Signed-off-by: Martin Chodur [email protected]
Closes #644
Addition of liveness and readiness endpoints to all components.
Added package
prober
which olds information if the component ishealthy
andready
.It can be registered to
Mux
orRouter
so that it can be used in themetricHTTPListenGroup
in components which does not have own UI or in components that has own routing.store
In the
store
the initial loading of cache was blocking start of the HTTP server thus it couldnt expose the liveness check. Because of that the initial cache update was moved to theg
as an actor and only readiness of the thanos-store is set to true when the cache is updated for the first time.receive
As discussed on slack the receive had two http interfaces which were merged together when adding the prober.
As a side effect it resolves #959
Verification
Tests are passing and it was tested on started every tjanos component type.