-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(*): add readiness & liveness probes #149
Conversation
rsp.S3Buckets = append(rsp.S3Buckets, convertBucket(buck)) | ||
} | ||
|
||
nsList, err := nsLister.List(labels.Everything(), fields.Everything()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not framework.NewInformer
like here to not hit the apiserver unnecessarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unfamiliar with k8s.io/kubernetes/pkg/controller/framework
, but it looks like it does not hit the API server, which I do need to do. Did I misunderstand something?
Also, if you know of a cheaper operation than listing namespaces (I'm assuming the common case is that there will not be many), I'd love to hear it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arschles I will post a snippet for listing namespaces using controller.framework
. The idea behind this abstraction is that uses watch and an implicit cache that reduces the round trips to api-server and transfer size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great, thanks. I'd definitely like to reduce general load on the k8s master.
Sent from my iPhone
On Feb 9, 2016, at 16:21, Manuel Alejandro de Brito Fontes [email protected] wrote:
In pkg/healthsrv/healthz_handler.go:
http.Error(w, str, http.StatusServiceUnavailable)
return
}
lbOut, err := bLister.ListBuckets(&s3.ListBucketsInput{})
if err != nil {
str := fmt.Sprintf("Error listing buckets (%s)", err)
log.Err(str)
http.Error(w, str, http.StatusServiceUnavailable)
return
}
var rsp healthZResp
for _, buck := range lbOut.Buckets {
rsp.S3Buckets = append(rsp.S3Buckets, convertBucket(buck))
}
@arschles I will post a snippet for listing namespaces using controller.framework. The idea behind this abstraction is that uses watch and an implicit cache that reduces the round trips to api-server and transfer sizensList, err := nsLister.List(labels.Everything(), fields.Everything())
—
Reply to this email directly or view it on GitHub.
aeba9b4
to
89c3a29
Compare
Is this still WIP? The WIP in the title and the "awaiting review" tag are giving me mixed signals. |
@krancour sorry about that. Yes, still in progress. I changed the label to indicate so. |
Tested myself. To test yourself, spin up a deis-dev cluster and |
@krancour this is now ready for review |
type Circuit struct { | ||
bit uint32 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not byte or bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below - the atomic
package has no facilities for manipulating either one of those types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I didn't know that trying to learn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically, I'm using atomic
here because the circuit can "get away" with simple bit flipping - and using atomic operations can easily do that in a cheaper way than with a mutex
why not modify fetcher to health serv. Do we need another http server to handle health serv. Already fetcher has a health check handler |
Anyways we are thinking of moving away from fetcher . So please Ignore my previous comment |
healthSrvCh := make(chan error) | ||
go func() { | ||
if err := healthsrv.Start(cnf.HealthSrvPort, kubeClient.Namespaces(), s3Client, circ); err != nil { | ||
healthSrvCh <- err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to check if namepsace is present or not ? builder won't start if namespace is not present or k8s cluster is not working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the goal here is to check that the k8s API is accessible. Given that the builder has started, we can assume that the namespace under which it runs (also specified in POD_NAMESPACE
) exists. The call that the health server makes is to just get a list of namespaces, and the goal there is only to see if the master is accessible.
I think @aledbf had an idea for a cheaper method of determining whether the API server was accessible, but if you have ideas I'd like to hear them too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arschles if the goal is just check the status of API server why not just run:
res := c.Get().AbsPath("/healthz").Do()
if res.Error() != nil{
// not possible to reach the server
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aledbf that's much better - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aledbf here's the rendered error when that code runs:
couldn't get version/kind; json parse error: invalid character 'o' looking for beginning of value
I'd like to keep this code as-is for now, and put in an issue to use the /healthz
way later. Objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objections?
None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If k8s API is not accessible how come it's builder fault . Also it not making sense to me to check if API is accessible or not as builder wont start in the first place. |
Also liveness probes shouldn't be checking S3 API accessibility |
before, we ran ‘glide nv’ before *all* commands, but it was only used in the ‘test’ target. this commit only runs it as part of that target
- containerPort: 3000 | ||
name: fetcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we missed a spot in #171 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - removed in 958e0b1
The fetcher was removed in deis#171, but the port wasn’t removed from the manifest in that PR. This removes it
) | ||
|
||
// CircuitState represents the state of a Circuit | ||
type CircuitState uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm not understanding the lower level synchronization types here, but from reading up to understand this bit of code from sync/atomic's docs:
These functions require great care to be used correctly. Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package. Share memory by communicating; don't communicate by sharing memory.
Not saying this should be refactored (this looks good to me), but I'm curious on your decision to go this route instead of using a combination of a boolean and mutal exclusion locks instead. Looking at the code, it looks like we only need two circuit states: open and closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, good question.
A bool
/sync.RWMutex
or channel-based implementation would work just as well here, and both would fit fine into this abstraction. I went the sync/atomic
route mainly because there's no need to protect a critical section of logic - just the mutation of a single boolean - and sync/atomic
can do concurrency-safe bit-twiddling well.
Basically, I originally checked this code in thinking it would be simplest and not because I had any performance or specific sync/atomic
benefits in mind. Now that I've gotten a few questions on it, I wonder if it'd be best to refactor it using a mutex, which seems to be more familiar. Thoughts?
Here's some more on sync/atomic
if you care:
Atomic integer and pointer manipulation usually does lend itself to lower level systems problems like databases, etc..., but the core concept still applies here - atomically manipulating bits. We're just approximating that single bit as a uint32
.
Also, on many systems, the sync/atomic
approach is more performant in a few different regards, but that's getting into scheduler and OS level implementation details, which is where I get fuzzy anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've gotten a few questions on it, I wonder if it'd be best to refactor it using a mutex, which seems to be more familiar. Thoughts?
Honestly it's all bikeshedding, which I hate. I'm quite happy with this implementation, and it serves its purpose as required, so it looks good to me. This was more of an educational exercise to understand why you chose this implementation is all. :)
On the plus side, I just learned about a new package today! 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alrighty, thanks. Enjoy your new atomicity?
@bacongobbler are you referring to the |
Sorry, that comment was in regards to refactoring out pkg/log to use the standard log package as in this commit. Github doesn't really show well that I was making a comment on that commit. |
Ah, got it. I see the tiny text where it says you commented on the commit now. Filed deis/pkg#19 to track |
Code LGTM |
Gonna hold off on merging this until the discussion at #149 (comment) is complete |
FWIW, here is my view on health checking logic:
Neither probes are about establishing "fault", the purpose of the probes is to indicate that the system believes itself to be in a position to fulfill its service contract. That's it. If it can't reach a database, cache, API endpoint, etc., it should fail its check. That said, I am a little concerned with probes being dependent on the state of Kubernetes itself. So the k8s API service availability is sort of a gray area. Originally, I was opposed to failing healthz based on k8s API server problems, but I've changed my mind. Here's the use case that caused me to change my mind:
In this case, builder SHOULD fail because it can no longer fulfill its service contract. If a policy was changed, k8s itself will restart the pod and it will come up successful. Otherwise, it will go into crash backoff, which is exactly what ought to happen. |
A liveness probe should not fail if k8s is down or s3 is reachable and here is why: A liveness probe will cause a pod to restart when those external components fail, which in turn means the user can not even do As a user (developer) I want to do a Builder is alive and can give good information back to the user, making the liveness probe depend on 3rd party things is a bad UX since the whole builder becomes unavailable at that point. I'm proposing we do the deeper health check during I'd like to hear the opposite side on how we'd handle the scenario of builder being down and telling the developers (and operators) what is going on and how they would effectively debug things when the builder is effectively left in a "running but not ready" state until s3 / k8s / etc comes up? |
To chime in my thoughts as well:
This is great for an administrator, but not for a user. In the current implementation the user has no view into why the builder is down. If we were to go down this route, a status page (similar to status pages like http://status.docker.com/) would be invaluable to a user to see what systems are down, bridging the communication gap between users and administrators.
^^ Another proposed solution offline was to gracefully fail a |
Ok, after extensive internal discussion, consensus is as follows:
|
fix(*): add readiness & liveness probes
Fixes #92
Fixes #89
Ref deis/deis#4809
When this is done, merge deis/charts#93
PR Overview
/healthz
), intended to be used by thelivenessProbe
andreadinessProbe
.interface
s - for S3 interaction and Kubernetes API interaction - so that those two components can be mocked and the server can be more easily unit tested/healthz
endpoint does all three checks concurrently, and only fails if one or more of them don't complete before a hard-coded timeout, or any of them fail before the timeout(github.com/deis/builder/pkg/sshd).Circuit
type, which acts as a simple, concurrency-safe on/off switch for communication across concurrency barriers. ACircuit
is used to allow the SSH server to tell the health check server when it's ready to serveboot.go
to serve the SSH server and the health check server concurrently, but to kill the process with an error code if either failsdeis-builder-rc.yaml
manifestMakefile
to only look for non-vendored packages whenmake test
is run, which makes all runs fasterStill TODO
healthZHandler
TestOpenCloseConcurrent
make test
output:pkg/sshd/server_test.go:106: not enough arguments in call to Serve