Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to Wait for node to reach ready state #141

Merged

Conversation

alejandrox1
Copy link
Contributor

@alejandrox1 alejandrox1 commented Nov 28, 2018

Add --wait flag which will make kind block until the master node of the
cluster reaches a ready state.

Resolves #42

/kind feature

Signed-off-by: Jorge Alarcon Ochoa [email protected]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2018
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 28, 2018
@k8s-ci-robot
Copy link
Contributor

Hi @alejandrox1. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 28, 2018
@alejandrox1
Copy link
Contributor Author

/assign @munnerz

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. labels Nov 29, 2018
Copy link
Member

@munnerz munnerz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great feature to add 😄

I've added a few comments, I think it'd be good to remodel the WaitForControlPlane method to better suite the Node structure it lives on.

/cc @BenTheElder
/kind feature
/ok-to-test

for _, line := range lines {
// We assume the control plane node contains the "control-plane"
// substring.
if strings.Contains(line, "control-plane") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're better to plumb in the node type or something similar to determine the type of node.

We could set something here:

nameOrID: name,

/cc @BenTheElder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. How about isConntrolPlane bool ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we're going to do this with the multi node work: #130 regarding having different kinds, will think about the short term fix for a bit

Copy link
Member

@BenTheElder BenTheElder Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need a way to detect this from a running node, for consistency. everything else we use / know about a node is either:

  • explicitly passed through in create
  • derivable at runtime

I think instead Create() can track the name(s) or *Node handles of control-plane node(s) and wait for these at the end of create

EDIT: Actually using a selector instead is probably the right route https://github.com/kubernetes-sigs/kind/pull/141/files#r237320426

pkg/cluster/nodes/node.go Outdated Show resolved Hide resolved
@@ -349,6 +351,17 @@ func (cc *createContext) provisionControlPlane(
}
}

// Wait for control plane node to reach Ready status.
if cc.waitForControlPlane {
if !node.WaitForControlPlane(time.Now().Add(time.Second * 30)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30s should be configurable - perhaps --wait=1m etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's make this configurable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubeadm has an initial timeout of 40sec for the kubelet to ready up post init and in parallel it has a healthz check for the api server that runs for 5 minutes (config via TimeoutForControlPlane in v1beta1).
https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm#APIServer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the control plane is up, but that doesn't mean non-control plane workloads are ready to schedule yet. I think default should be zero (cluster create is snappy locally), and configurable for CI where you can decide yourself how long is worth waiting before just trying to run your tests.

If your tests fail, you need to identify why already, and this way it's up to you how much you want to try to wait.

The wait still won't be a strong guarantee, and I don't think we want to solve that yet, that seems like something that should be solved with a kubeadm command instead of being implemented just for kind.

On manual local usage, you won't be instantly trying to schedule workloads anyhow.

pkg/cluster/context.go Outdated Show resolved Hide resolved
// WaitForControlPlane uses kubectl inside the "node" container to check if
// the control plane node status is "Ready". The control plane node is found by
// looking for the node containing the substring "control-plane" in its name.
func (n *Node) WaitForControlPlane(until time.Time) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically, I think creating this function as something more like WaitForReady, or even Status, and then having the call-site (i.e. the cluster package) determine which nodes to call this function on makes more sense.

The kubectl command can then be updated to kubectl get node {nodeName} instead of how we currently list all nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use a selector instead, kubeadm applies the label node-role.kubernetes.io/master: ""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with regard to a kubectl based check, i think the ready state on the node object is a safer bet, because the taints and labels are a kubeadm init phase that can be omitted using kubeadm init --skip-phases=mark-control-plane in 1.13.

on the other hand, i'm pretty sure i've seen a case where a node can be ready while some pods are failing, so from my experience watching for an expected list of pods is the only truth of a healthy node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not waiting to see if it's healthy, we're waiting for scheduling to likely not fail for test pods in https://testgrid.k8s.io/conformance-kind#kind,%20master%20(dev)%20%5Bnon-serial%5D,

I don't think kind should build it's own healthchecking yet. but logically workloads cannot schedule if the nodes are not in the ready state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for workloads this is OK.
for things like join --experimental-control-plane it can fail if the existing etcd cluster is not ready to join yet another etcd node. there are some tricky timeouts around HA, but we shall see.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 29, 2018
@munnerz
Copy link
Member

munnerz commented Nov 29, 2018

/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Nov 29, 2018
@alejandrox1 alejandrox1 force-pushed the issue-42-wait-for-node branch from 9d9e612 to 48ef935 Compare November 29, 2018 00:30
pkg/cluster/context.go Outdated Show resolved Hide resolved
@alejandrox1 alejandrox1 force-pushed the issue-42-wait-for-node branch from 48ef935 to 11a8a7d Compare November 29, 2018 03:26
@alejandrox1
Copy link
Contributor Author

Thank you for the comments. Changes have been made.
Tell me what yall think about it. Here I went for formatting kubectl's output since the call to the function is made inside ProvisionControlPlane which makes available the node's name.
I feel like using the selector "node-role.kubernetes.io/master" would be better for a method that waits for all master nodes (ha? cluster).

@BenTheElder
Copy link
Member

Tell me what yall think about it. Here I went for formatting kubectl's output since the call to the function is made inside ProvisionControlPlane which makes available the node's name.

I don't think we should pattern match on node names.

I feel like using the selector "node-role.kubernetes.io/master" would be better for a method that waits for all master nodes (ha? cluster).

Well that's roughly equivalent to listing all nodes then matching the names, just using an actual identifier instead. Waiting for all of them to be ready is the same as waiting on one control plane to be ready when we have a single control plane node.

It also means we can use the same flag and same code when we do get multi-node.

@alejandrox1 alejandrox1 force-pushed the issue-42-wait-for-node branch from 11a8a7d to c66e2f1 Compare November 30, 2018 01:36
@alejandrox1
Copy link
Contributor Author

@BenTheElder that makes sense, thank you.
And change made

@alejandrox1 alejandrox1 force-pushed the issue-42-wait-for-node branch from c66e2f1 to cf6d7fc Compare November 30, 2018 01:38
@alejandrox1 alejandrox1 force-pushed the issue-42-wait-for-node branch from cf6d7fc to 40b9168 Compare November 30, 2018 16:32
@alejandrox1
Copy link
Contributor Author

/retest

@alejandrox1
Copy link
Contributor Author

/test pull-kind-conformance-parallel

@neolit123
Copy link
Member

/retest
the -parellel job seems to be failing a lot.

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this extension @alejandrox1
added a couple of minor comments.

pkg/cluster/context.go Show resolved Hide resolved
pkg/cluster/context.go Outdated Show resolved Hide resolved
pkg/cluster/nodes/nodes.go Outdated Show resolved Hide resolved
"--selector=node-role.kubernetes.io/master",
// When the node reaches status ready, the status field will be set
// to true.
"-o=jsonpath='{.items..status.conditions[-1:].status}'",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like i mentioned in one of the comments the better approach is to watch an expected list of pods, but this is also OK in most cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the e2e tests already do a variant of this, and it doesn't work properly. the pods exist before CNI is ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given we are hardcoding weave, we can potentially anticipate it (somehow).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also probably temporary though.

@@ -50,6 +52,7 @@ func NewCommand() *cobra.Command {
cmd.Flags().StringVar(&flags.Config, "config", "", "path to a kind config file")
cmd.Flags().StringVar(&flags.ImageName, "image", "", "node docker image to use for booting the cluster")
cmd.Flags().BoolVar(&flags.Retain, "retain", false, "retain nodes for debugging when cluster creation fails")
cmd.Flags().DurationVar(&flags.Wait, "wait", time.Duration(0), "Wait for master node to be ready")
Copy link
Member

@neolit123 neolit123 Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master -> control plane
needs examples such as 30s, 10m

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a default in the cmd line flag help message which indicates the use of units.
Other than that, to clarify this I added it in the user guide to avoid overcrowding the help message.

pkg/cluster/nodes/nodes.go Show resolved Hide resolved
if lines[0] == "'True'" {
return true
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth adding a small time.Sleep() duration in the function. 1 second seems OK to me.

@alejandrox1 alejandrox1 force-pushed the issue-42-wait-for-node branch from 40b9168 to 06c74bd Compare December 3, 2018 17:39
@alejandrox1
Copy link
Contributor Author

/retest

2 similar comments
@alejandrox1
Copy link
Contributor Author

/retest

@alejandrox1
Copy link
Contributor Author

/retest

Add --wait flag which will make kind block until the master node of the
cluster reaches a ready state.

Resolves kubernetes-sigs#42

/kind feature

Signed-off-by: Jorge Alarcon Ochoa <[email protected]>
Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 6, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alejandrox1, BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2018
@k8s-ci-robot k8s-ci-robot merged commit b887098 into kubernetes-sigs:master Dec 6, 2018
@alejandrox1 alejandrox1 deleted the issue-42-wait-for-node branch January 22, 2019 18:33
stg-0 pushed a commit to stg-0/kind that referenced this pull request Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants