-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation: treat as error if insufficient nodes #4703
Conversation
cmd/kops/validate_cluster.go
Outdated
} | ||
|
||
// Do not use if we are running gossip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the logic behind moving this? This seems to be an improvement of the code, not fixing the issue.
Wondering if we should put this in another PR. This needs to be tested on both Mac and Linux, since the OS behaves differently with Mac and caching dns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're moving the code, not changing it AFAIK.
The reason for moving it is that we want this to be part of validation, I believe, including on a rolling-update.
dd1e8a1
to
d7f14ce
Compare
5198296
to
1de1541
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like it. Two big concerns
- Lots of changes that are not related to the bug. I would rather have at least two PRs. One that fixes the bug or bugs, another that does improvements
- You seem to have changed the schema. I love the new format, but it is a breaking change for the users. Should we move the schema into the api?
cmd/kops/validate_cluster.go
Outdated
} | ||
|
||
// Do not use if we are running gossip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the logic behind moving this? This seems to be an improvement of the code, not fixing the issue.
Wondering if we should put this in another PR. This needs to be tested on both Mac and Linux, since the OS behaves differently with Mac and caching dns.
fmt.Fprint(out, "\nValidation Failed\n") | ||
fmt.Fprintf(out, "Ready Master(s) %d out of %d.\n", len(validationCluster.MastersReadyArray), validationCluster.MastersCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you move this output to? Again seems to be code improvements / UX changes not dealing with the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have these changes in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the failure list. The problem is that our model - particularly for nodes - is richer than a simple count, so these messages would do things like "Ready nodes 10 out of 6" etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly as an operator that part of the ux was incredibly useful. I would prefer to not remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I would like to see it in the instance group grid above. It just isn't well-defined here, sadly.
pkg/validation/node_conditions.go
Outdated
// It is considered ready if: | ||
// 1) its Ready condition is set to true | ||
// 2) doesn't have NetworkUnavailable condition set to true | ||
func isNodeReady(node *v1.Node) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a lot of this code initially was taken from upstream from the e2e testing. Did you do the same, or did you do a general rewrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code that's left is the functions that were used, and then I collapsed the functions a little bit to be more readable. Logic should be unchanged.
pkg/validation/node_conditions.go
Outdated
func isNodeReady(node *v1.Node) bool { | ||
nodeReadyCondition := findNodeCondition(node, v1.NodeReady) | ||
if nodeReadyCondition == nil { | ||
glog.Warningf("v1.NodeReady condition not set on node %s", node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do more human friendly message?
How about
Node %q is not ready, no condition is set.
I know what v1.NodeReady is, but I know a lot of non programmers will not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this one is that we don't expect this to be the case: we expect there always to be a NodeReady condition. So it's not that the node is or isn't ready, it's that our assumptions aren't valid. i.e. this isn't a user-facing message, this is an "uh-oh" message for us.
pkg/validation/node_conditions.go
Outdated
return false | ||
} | ||
} else { | ||
glog.V(4).Infof("v1.NodeNetworkUnavailable condition not set on node %s", node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about the log message. You also switched from warning to info? I think all of the messages should be info. Thoughts? A warning to me is some is wrong. A new node can be so new that it does not have a node condition, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is V(4) and just diagnostics, this isn't unexpected, but isn't really user-facing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that thanks
"k8s.io/api/core/v1" | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
"k8s.io/client-go/kubernetes" | ||
"k8s.io/client-go/tools/clientcmd" | ||
"k8s.io/kops/pkg/apis/kops" | ||
"k8s.io/kops/pkg/apis/kops/util" | ||
"k8s.io/kops/pkg/cloudinstances" | ||
"k8s.io/kops/pkg/dns" | ||
"k8s.io/kops/upup/pkg/fi/cloudup" | ||
) | ||
|
||
// ValidationCluster a cluster to validate. | ||
type ValidationCluster struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a breaking change. I know people are using the yaml and json progrmatically. I wanted to have an api for this, but I think we could not decide on it. Options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me ask a question. Does this change the validation output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The yaml / json output is not currently part of our API. If we want to make it part of our API, I think the structure I'm proposing here is a step towards that - we want an arbitrary list of structured validation failures IMO. I'm not sure if the node list really belongs - it is more informational - we could consider splitting that out.
return validationCluster, fmt.Errorf(validationCluster.ErrorMessage) | ||
} | ||
// TODO: Use instance group role instead... | ||
if n.Role == "master" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t we have helped metbods for this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do if we switch to the instance group, but that would be a bigger behavioural change. The TODO to do so remains though :-)
@chrislovecnm asked me to give this a test and this seems to work fine for me, upgrading/rolling-over a cluster created via master. |
I think this is the last PR blocking 1.9.0-alpha.2 I've split it up into smaller PRs at @chrislovecnm 's request. |
/hold Merging others that are conprised by this PR |
1de1541
to
afabe66
Compare
d185ae9
to
af3314f
Compare
We switch to use the rolling update logic, which gives us nodes by InstanceGroup.
af3314f
to
63e5133
Compare
OK split out all the low hanging fruit. I think this is last PR blocking 1.9.0-alpha.2 PTAL @mikesplain @chrislovecnm |
/hold cancel |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chrislovecnm, justinsb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
We switch to use the rolling update logic, which gives us nodes by
InstanceGroup.