-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flag to Wait for node to reach ready state #141
Add flag to Wait for node to reach ready state #141
Conversation
Hi @alejandrox1. Thanks for your PR. I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @munnerz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great feature to add 😄
I've added a few comments, I think it'd be good to remodel the WaitForControlPlane method to better suite the Node
structure it lives on.
/cc @BenTheElder
/kind feature
/ok-to-test
pkg/cluster/nodes/node.go
Outdated
for _, line := range lines { | ||
// We assume the control plane node contains the "control-plane" | ||
// substring. | ||
if strings.Contains(line, "control-plane") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're better to plumb in the node type or something similar to determine the type of node.
We could set something here:
kind/pkg/cluster/nodes/create.go
Line 88 in 9d9e612
nameOrID: name, |
/cc @BenTheElder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. How about isConntrolPlane bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we're going to do this with the multi node work: #130 regarding having different kinds, will think about the short term fix for a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need a way to detect this from a running node, for consistency. everything else we use / know about a node is either:
- explicitly passed through in create
- derivable at runtime
I think instead Create() can track the name(s) or *Node
handles of control-plane node(s) and wait for these at the end of create
EDIT: Actually using a selector instead is probably the right route https://github.com/kubernetes-sigs/kind/pull/141/files#r237320426
pkg/cluster/context.go
Outdated
@@ -349,6 +351,17 @@ func (cc *createContext) provisionControlPlane( | |||
} | |||
} | |||
|
|||
// Wait for control plane node to reach Ready status. | |||
if cc.waitForControlPlane { | |||
if !node.WaitForControlPlane(time.Now().Add(time.Second * 30)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30s should be configurable - perhaps --wait=1m
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let's make this configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubeadm has an initial timeout of 40sec for the kubelet to ready up post init and in parallel it has a healthz check for the api server that runs for 5 minutes (config via TimeoutForControlPlane in v1beta1).
https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm#APIServer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the control plane is up, but that doesn't mean non-control plane workloads are ready to schedule yet. I think default should be zero (cluster create is snappy locally), and configurable for CI where you can decide yourself how long is worth waiting before just trying to run your tests.
If your tests fail, you need to identify why already, and this way it's up to you how much you want to try to wait.
The wait still won't be a strong guarantee, and I don't think we want to solve that yet, that seems like something that should be solved with a kubeadm command instead of being implemented just for kind
.
On manual local usage, you won't be instantly trying to schedule workloads anyhow.
pkg/cluster/nodes/node.go
Outdated
// WaitForControlPlane uses kubectl inside the "node" container to check if | ||
// the control plane node status is "Ready". The control plane node is found by | ||
// looking for the node containing the substring "control-plane" in its name. | ||
func (n *Node) WaitForControlPlane(until time.Time) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically, I think creating this function as something more like WaitForReady
, or even Status
, and then having the call-site (i.e. the cluster
package) determine which nodes to call this function on makes more sense.
The kubectl
command can then be updated to kubectl get node {nodeName}
instead of how we currently list all nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use a selector instead, kubeadm applies the label node-role.kubernetes.io/master: ""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with regard to a kubectl based check, i think the ready state on the node object is a safer bet, because the taints and labels are a kubeadm init phase
that can be omitted using kubeadm init --skip-phases=mark-control-plane
in 1.13.
on the other hand, i'm pretty sure i've seen a case where a node can be ready while some pods are failing, so from my experience watching for an expected list of pods is the only truth of a healthy node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not waiting to see if it's healthy, we're waiting for scheduling to likely not fail for test pods in https://testgrid.k8s.io/conformance-kind#kind,%20master%20(dev)%20%5Bnon-serial%5D,
I don't think kind should build it's own healthchecking yet. but logically workloads cannot schedule if the nodes are not in the ready state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for workloads this is OK.
for things like join --experimental-control-plane
it can fail if the existing etcd cluster is not ready to join yet another etcd node. there are some tricky timeouts around HA, but we shall see.
/remove-kind bug |
9d9e612
to
48ef935
Compare
48ef935
to
11a8a7d
Compare
Thank you for the comments. Changes have been made. |
I don't think we should pattern match on node names.
Well that's roughly equivalent to listing all nodes then matching the names, just using an actual identifier instead. Waiting for all of them to be ready is the same as waiting on one control plane to be ready when we have a single control plane node. It also means we can use the same flag and same code when we do get multi-node. |
11a8a7d
to
c66e2f1
Compare
@BenTheElder that makes sense, thank you. |
c66e2f1
to
cf6d7fc
Compare
cf6d7fc
to
40b9168
Compare
/retest |
/test pull-kind-conformance-parallel |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this extension @alejandrox1
added a couple of minor comments.
"--selector=node-role.kubernetes.io/master", | ||
// When the node reaches status ready, the status field will be set | ||
// to true. | ||
"-o=jsonpath='{.items..status.conditions[-1:].status}'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like i mentioned in one of the comments the better approach is to watch an expected list of pods, but this is also OK in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the e2e tests already do a variant of this, and it doesn't work properly. the pods exist before CNI is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given we are hardcoding weave, we can potentially anticipate it (somehow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also probably temporary though.
@@ -50,6 +52,7 @@ func NewCommand() *cobra.Command { | |||
cmd.Flags().StringVar(&flags.Config, "config", "", "path to a kind config file") | |||
cmd.Flags().StringVar(&flags.ImageName, "image", "", "node docker image to use for booting the cluster") | |||
cmd.Flags().BoolVar(&flags.Retain, "retain", false, "retain nodes for debugging when cluster creation fails") | |||
cmd.Flags().DurationVar(&flags.Wait, "wait", time.Duration(0), "Wait for master node to be ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master
-> control plane
needs examples such as 30s
, 10m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a default in the cmd line flag help message which indicates the use of units.
Other than that, to clarify this I added it in the user guide to avoid overcrowding the help message.
pkg/cluster/nodes/nodes.go
Outdated
if lines[0] == "'True'" { | ||
return true | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth adding a small time.Sleep() duration in the function. 1 second seems OK to me.
40b9168
to
06c74bd
Compare
/retest |
2 similar comments
/retest |
/retest |
Add --wait flag which will make kind block until the master node of the cluster reaches a ready state. Resolves kubernetes-sigs#42 /kind feature Signed-off-by: Jorge Alarcon Ochoa <[email protected]>
06c74bd
to
e73d96f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alejandrox1, BenTheElder The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…allation Fix eks calico installation
Add --wait flag which will make kind block until the master node of the
cluster reaches a ready state.
Resolves #42
/kind feature
Signed-off-by: Jorge Alarcon Ochoa [email protected]