Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Added /-/healthy and /-/ready endpoints to all thanos components #656

Closed
wants to merge 8 commits into from

Conversation

FUSAKLA
Copy link
Member

@FUSAKLA FUSAKLA commented Dec 3, 2018

Signed-off-by: Martin Chodur [email protected]
Closes #644

Addition of liveness and readiness endpoints to all components.
Added package prober which olds information if the component is healthy and ready.
It can be registered to Mux or Router so that it can be used in the metricHTTPListenGroup in components which does not have own UI or in components that has own routing.

store
In the store the initial loading of cache was blocking start of the HTTP server thus it couldnt expose the liveness check. Because of that the initial cache update was moved to the g as an actor and only readiness of the thanos-store is set to true when the cache is updated for the first time.

receive
As discussed on slack the receive had two http interfaces which were merged together when adding the prober.
As a side effect it resolves #959

Verification

Tests are passing and it was tested on started every tjanos component type.

@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch from 98785a9 to f0fbd10 Compare December 3, 2018 00:22
@adrien-f
Copy link
Member

adrien-f commented Dec 3, 2018

Awesome ! It's true that I only added the healthy route to the Querier, that was selfish 😄 !

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Ok, this is quite confusing as there are 2 types of healthchecks ready and liveness. In my opinion we should add both if we want to NOT confuse users. Similar to what Prometheus did here:
prometheus/pushgateway#135
and this discussion: prometheus/pushgateway#105

So basically we need /-/ready and /-/healthy, currently for most components it would be same handling (serve it in http metric server), however for store it needs to be more complex:

  • liveness (healthy) needs to run from the beginning
  • readiness (ready) needs to be OK only once we synced all meta files and start serving gRPC requests.

What do you think?

cmd/thanos/main.go Outdated Show resolved Hide resolved
cmd/thanos/main.go Outdated Show resolved Hide resolved
kube/manifests/thanos-store.yaml Outdated Show resolved Hide resolved
@FUSAKLA
Copy link
Member Author

FUSAKLA commented Dec 9, 2018

Again sorry for delay. Thank you for all the comments.

Adding also the /-/ready endpoint would be great.
I'll correct the /-/healthy endpoint not to be blocked by any startup operations.

Regarding the readiness probe:

  • Rule: Not aware of any other condition we should wait till saying it's ready? Maybe gRPC serving also?
  • Store: Wait until synced all meta files and start serving gRPC .
  • Query: Possibly wait for getHealthyStores ? but maybe should be ready as soon as the UI works.
  • Compactor: Not sure if it should even have readiness probe since it does not even have any API.
  • Sidecar: Here we could check the promUp for the ready state since blocks shipping would continue even when the component is not ready?

If we agree on the correct way to check for the ready state in every component I'd be happy to add it.

@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch 2 times, most recently from d6e4b37 to 16fd343 Compare December 10, 2018 23:26
@FUSAKLA
Copy link
Member Author

FUSAKLA commented Dec 10, 2018

So I refactored it to match (I hope) my suggestions in previous comment.
Now every component should have /-/healthy and /-/ready endpoint and hopefully the readiness probe should honor state when all API endpoints of the component are ready to be served HTTP and gRPC (or just one of them if other is not present.) The liveness healthy probe should be served soon after start before any blocking operations.

Would something like that be acceptable? I'd add some tests but those run<Component> functions are huge and I'd have to split out the HTTP server initialization which would be bigger change to test those instrumentation endpoints without spinning up the whole node.

cmd/thanos/rule.go Outdated Show resolved Hide resolved
@adrien-f
Copy link
Member

It feels weird to have ̀registerHealthyandregisterReady` and only using it for the sidecar and the store components while reimplemeting it in ruler and querier. Could these not be used everywhere ?

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the direction is very nice, but some suggestions.

The major problem is that we have race issues here. Remember that in Golang almost nothing is atomic (thread safe) out of the box. Not even boolean value. We need to change our code to be concurrent safe here as we set/read xxxIsReady from different go routines.

We could wrap it with lock or sync.Atomic but actually we can build design something nicer, with the suggestion @adrien-f gave: To have generic method for this everywhere.

Let's focus what is generic. Generic is:

  • registration. It is always on the same path but either on mux or router.
  • Readiness and Healthiness handling based on IsHealthy or IsReady` methods.

This allows us to define struct that everyone will use with following methods:

type Prober struct {
   readyMtx sync.Mutex
   readiness error
   healthyMtx sync.Mutex
   healthiness error
}

func NewProbeInRouter(..) *Prober
func NewProbeInMux(..) *Prober

func (p *Prober) IsReady() error {
  p.readyMtx.Lock()
  defer p.readyMtx.Unlock()
  return p.readiness
}

func (p *Prober) Ready() {
  p.NotReady(nil)
}

func (p *Prober) NotReady(err error) {
  p.readyMtx.Lock()
  defer p.readyMtx.Unlock()
  p.readiness = err
}

// etc...

What do you think? (:

CHANGELOG.md Outdated Show resolved Hide resolved
cmd/thanos/compact.go Outdated Show resolved Hide resolved
cmd/thanos/main.go Outdated Show resolved Hide resolved
cmd/thanos/main.go Outdated Show resolved Hide resolved
cmd/thanos/main.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
cmd/thanos/rule.go Outdated Show resolved Hide resolved
cmd/thanos/sidecar.go Outdated Show resolved Hide resolved
@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch 2 times, most recently from b103c49 to f518ced Compare January 13, 2019 02:12
@FUSAKLA
Copy link
Member Author

FUSAKLA commented Jan 13, 2019

Sorry about the delay I cannot find the time to finish this.
(I resolved all the comments without commenting because the code was completely rewritten, sorry for that)

Thank you so much for all the comments both of you. @bwplotka your suggestions on implementation were great so I implemented it as you suggested. At least I hope I understood you well :)

The Prober should be covered with tests and there is still test for basic http endpoints in the main_test.go. All are passing and I tried building and running all the components and this is also OK.
Nicely working for example query node without configuration where it's never ready and returning error on calling prometheus API.

I'd be glad to discuss more if the points where I'm setting nodes ready and healthy are ok and should be added any more or moved possibly.

Thanks for all the advises!

@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch from f518ced to 32d55d7 Compare January 13, 2019 02:28
pkg/prober/prober.go Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

@FUSAKLA can we back to this? Rebase & and change title of PR to reflect changes? I think this is hitting us more recently (:

CC @SuperQ

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Mar 18, 2019

ouch.. yep I'll take a look and re-base it so we can finish this off

@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch from 05b7eb2 to d505781 Compare March 22, 2019 23:25
@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch from a8d9a27 to 059759c Compare April 15, 2019 10:24
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, but I am still seeing;

  • non resolved comments
  • readiness used in places where healthyness should be used? Lot's of inconsistencies IMO

cmd/thanos/compact.go Outdated Show resolved Hide resolved
cmd/thanos/compact.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Show resolved Hide resolved
@@ -133,8 +140,9 @@ func runSidecar(
"msg", "failed to fetch initial external labels. Is Prometheus running? Retrying",
"err", err,
)
readinessProber.SetNotReady(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... because we use metricHTTPListenGroup its Ready, and then suddenly no ready here? I think it's quite nasty race.. As being marked rdy, and then suddenly not, means that container will be restarted, however we have retry here.

Copy link
Member Author

@FUSAKLA FUSAKLA Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, in this case it is bit unfortunate that's true.
Being marked not ready does not cause restart of the container that would cause being not healthy. But it could cause requests being sent to the sidecar even when hasn't yet fetched the external labels.

I'll leave just the readinessProber.SetHealthy() set the readiness outside of the metricHTTPListenGroup depending on each component.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved away setting the ready status from the default http listener
5e9a4c4

@@ -172,32 +180,34 @@ func runSidecar(
if err := m.UpdateLabels(iterCtx, logger); err != nil {
level.Warn(logger).Log("msg", "heartbeat failed", "err", err)
promUp.Set(0)
readinessProber.SetNotReady(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this healthyness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say so. You don't want to get restarted when Prometheus just doesn't respond for the external labels query or do you?

} else {
// Update gossip.
peer.SetLabels(m.LabelsPB())

promUp.Set(1)
readinessProber.SetReady()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return errors.Wrap(s.Serve(l), "serve gRPC")
}, func(error) {
}, func(err error) {
readinessProber.SetNotReady(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, setting in one function like this is enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also healthyness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand correctly what exactly do you mean by the setting in one function.
You mean dropping at all changing the ready status because of prom ext labels fetch?

Also not sure about the readiness vs healthyness. The sidecar in this case could be still shipping some buckets to OS so killing it just because the gRPC interface has malfunction could be too harsh?

@bwplotka
Copy link
Member

bwplotka commented Apr 15, 2019

@FUSAKLA

I personally don't like the Store liveness blocked by bucket init otherwise I'd say it's ok?
I'd be glad for any suggestions, thanks!

Let's fix this in later PR. It's not trivial

ready: Once gRPC starts listening (can change same as prom_up metric)

This is tricky. Why readiness fails? Not liveness?

Also wonder if that is not too flaky.. but let's say it's ok

@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch from 059759c to 0de1bba Compare April 22, 2019 13:27
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this is looking pretty good, but a lot of behavior and it feels easy to miss something, but I think we can move forward with this, but I'd be good if @bwplotka can make a final call.

cmd/thanos/main_test.go Show resolved Hide resolved
test/e2e/spinup_test.go Show resolved Hide resolved
@xjewer
Copy link
Contributor

xjewer commented May 10, 2019

fixes #532

@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch from 5e9a4c4 to b51c5a5 Compare May 12, 2019 07:18
@FUSAKLA
Copy link
Member Author

FUSAKLA commented May 12, 2019

Rebased on master

@FUSAKLA FUSAKLA force-pushed the fus-add-health-endpoint branch from b51c5a5 to 55ab8a4 Compare May 26, 2019 21:06
return prober
}

// HandleInMux registers readiness and liveness probes to mux.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems off. The method is called RegisterInRouter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it was leftover after refactoring.

f(w, r)
return
}
p.writeResponse(w, p.IsReady, "ready")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a small error here. By the time you call this. p.IsReady() might start indicating that it is suddenly ready, right? AFAICT you need to do both of these actions while p.readyMtx is locked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! The chances are really small but still this is a race. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL if this way it's ok with you

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Jun 10, 2019

Just to be clear, we agreed with @bwplotka that I'll split this PR to multiple smaller ones because this is too big to review and could cause various issues regarding number of changes in behavior.
I'll change this to a draft or wip or something like this until I split it up

Thanks Bartek for making the call 👍

EDIT: Changing from PR to draft back is not possible unfortunately so added WIP: to the name for now.

@FUSAKLA FUSAKLA changed the title Added /-/healthy and /-/ready endpoints to all thanos components WIP: Added /-/healthy and /-/ready endpoints to all thanos components Jun 10, 2019
@FUSAKLA
Copy link
Member Author

FUSAKLA commented Jul 2, 2019

#1297

@bwplotka
Copy link
Member

This was splitted into smaller PRs by @FUSAKLA Thanks for this! I think this means we can close this one? (:

@bwplotka bwplotka closed this Sep 17, 2019
@FUSAKLA
Copy link
Member Author

FUSAKLA commented Sep 17, 2019

Yes definitely to avoid confusion, thanks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

receive: Thanos receive hangs on SIGINT Add missing /-/healthy endpoint
8 participants