Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation: treat as error if insufficient nodes #4703

Merged
merged 1 commit into from
Mar 21, 2018

Conversation

justinsb
Copy link
Member

We switch to use the rolling update logic, which gives us nodes by
InstanceGroup.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 18, 2018
}

// Do not use if we are running gossip
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into validation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the logic behind moving this? This seems to be an improvement of the code, not fixing the issue.

Wondering if we should put this in another PR. This needs to be tested on both Mac and Linux, since the OS behaves differently with Mac and caching dns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're moving the code, not changing it AFAIK.

The reason for moving it is that we want this to be part of validation, I believe, including on a rolling-update.

@justinsb justinsb force-pushed the int_validation_fix branch from dd1e8a1 to d7f14ce Compare March 18, 2018 06:51
@justinsb justinsb requested a review from chrislovecnm March 18, 2018 06:51
@justinsb justinsb force-pushed the int_validation_fix branch 2 times, most recently from 5198296 to 1de1541 Compare March 18, 2018 06:54
Copy link
Contributor

@chrislovecnm chrislovecnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like it. Two big concerns

  1. Lots of changes that are not related to the bug. I would rather have at least two PRs. One that fixes the bug or bugs, another that does improvements
  2. You seem to have changed the schema. I love the new format, but it is a breaking change for the users. Should we move the schema into the api?

}

// Do not use if we are running gossip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the logic behind moving this? This seems to be an improvement of the code, not fixing the issue.

Wondering if we should put this in another PR. This needs to be tested on both Mac and Linux, since the OS behaves differently with Mac and caching dns.

fmt.Fprint(out, "\nValidation Failed\n")
fmt.Fprintf(out, "Ready Master(s) %d out of %d.\n", len(validationCluster.MastersReadyArray), validationCluster.MastersCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you move this output to? Again seems to be code improvements / UX changes not dealing with the bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have these changes in another PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the failure list. The problem is that our model - particularly for nodes - is richer than a simple count, so these messages would do things like "Ready nodes 10 out of 6" etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly as an operator that part of the ux was incredibly useful. I would prefer to not remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I would like to see it in the instance group grid above. It just isn't well-defined here, sadly.

// It is considered ready if:
// 1) its Ready condition is set to true
// 2) doesn't have NetworkUnavailable condition set to true
func isNodeReady(node *v1.Node) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a lot of this code initially was taken from upstream from the e2e testing. Did you do the same, or did you do a general rewrite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that's left is the functions that were used, and then I collapsed the functions a little bit to be more readable. Logic should be unchanged.

func isNodeReady(node *v1.Node) bool {
nodeReadyCondition := findNodeCondition(node, v1.NodeReady)
if nodeReadyCondition == nil {
glog.Warningf("v1.NodeReady condition not set on node %s", node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do more human friendly message?

How about

Node %q is not ready, no condition is set.

I know what v1.NodeReady is, but I know a lot of non programmers will not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this one is that we don't expect this to be the case: we expect there always to be a NodeReady condition. So it's not that the node is or isn't ready, it's that our assumptions aren't valid. i.e. this isn't a user-facing message, this is an "uh-oh" message for us.

return false
}
} else {
glog.V(4).Infof("v1.NodeNetworkUnavailable condition not set on node %s", node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about the log message. You also switched from warning to info? I think all of the messages should be info. Thoughts? A warning to me is some is wrong. A new node can be so new that it does not have a node condition, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is V(4) and just diagnostics, this isn't unexpected, but isn't really user-facing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that thanks

"k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/kops/pkg/apis/kops"
"k8s.io/kops/pkg/apis/kops/util"
"k8s.io/kops/pkg/cloudinstances"
"k8s.io/kops/pkg/dns"
"k8s.io/kops/upup/pkg/fi/cloudup"
)

// ValidationCluster a cluster to validate.
type ValidationCluster struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a breaking change. I know people are using the yaml and json progrmatically. I wanted to have an api for this, but I think we could not decide on it. Options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me ask a question. Does this change the validation output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The yaml / json output is not currently part of our API. If we want to make it part of our API, I think the structure I'm proposing here is a step towards that - we want an arbitrary list of structured validation failures IMO. I'm not sure if the node list really belongs - it is more informational - we could consider splitting that out.

return validationCluster, fmt.Errorf(validationCluster.ErrorMessage)
}
// TODO: Use instance group role instead...
if n.Role == "master" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t we have helped metbods for this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do if we switch to the instance group, but that would be a bigger behavioural change. The TODO to do so remains though :-)

@mikesplain
Copy link
Contributor

@chrislovecnm asked me to give this a test and this seems to work fine for me, upgrading/rolling-over a cluster created via master.

@justinsb
Copy link
Member Author

I think this is the last PR blocking 1.9.0-alpha.2 I've split it up into smaller PRs at @chrislovecnm 's request.

@justinsb justinsb added this to the 1.9.0 milestone Mar 20, 2018
@chrislovecnm
Copy link
Contributor

/hold

Merging others that are conprised by this PR

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 20, 2018
@justinsb justinsb force-pushed the int_validation_fix branch from 1de1541 to afabe66 Compare March 21, 2018 00:59
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 21, 2018
@justinsb justinsb force-pushed the int_validation_fix branch 2 times, most recently from d185ae9 to af3314f Compare March 21, 2018 01:02
@justinsb justinsb removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2018
We switch to use the rolling update logic, which gives us nodes by
InstanceGroup.
@justinsb justinsb force-pushed the int_validation_fix branch from af3314f to 63e5133 Compare March 21, 2018 03:28
@justinsb
Copy link
Member Author

OK split out all the low hanging fruit. I think this is last PR blocking 1.9.0-alpha.2 PTAL @mikesplain @chrislovecnm

@chrislovecnm
Copy link
Contributor

/hold cancel
/test pull-kops-e2e-kubernetes-aws

@chrislovecnm
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrislovecnm, justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [chrislovecnm,justinsb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 4d2bedc into kubernetes:master Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. blocks-next cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants