-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable locksmithd on CoreOS if UpdatePolicy set #4909
Conversation
9f4a42b
to
843b82b
Compare
/assign @chrislovecnm @justinsb Wasn't sure whether you'd prefer for this to be wrapped in an API flag? I noticed the flag UpdatePolicy, however the default value for that flag (missing) states applying upgrades automatically, so users would need to explicitly set the value to |
843b82b
to
28c20a8
Compare
@@ -2125,6 +2125,7 @@ func autoConvert_v1alpha1_KubeProxyConfig_To_kops_KubeProxyConfig(in *KubeProxyC | |||
out.HostnameOverride = in.HostnameOverride | |||
out.Master = in.Master | |||
out.Enabled = in.Enabled | |||
out.ProxyMode = in.ProxyMode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason we are missing this ... we will need to get this into the 1.9 release as well. I am pushing a pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok great, cheers 👍 It somehow got past the tests and merged into master, but the test run in this PR failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, I work on CoreOS Container Linux and wanted to mention some other options worth considering.
First of all, we're sorry stable broke users here. We do our best to catch issues before they hit users, but the /dev/
path thing wasn't hit in our tests nor reported by users and so slipped through.
We obviously prefer if our users do remain updated and secure, so I think it would be preferable to do something a little less aggressive than masking both update-engine and locksmthid.
The options are basically the following (as discussed in the docs):
- Disable update-engine + locksmithd — Updates are not downloaded or applied at any time
- Disable locksmithd — Updates are downloaded and prepared, but wait for an alternate update-coordinator (such as CLUO) or other reboot (e.g. operator reboot or crash) to apply.
- Configure locksmithd to use the etcd-lock (or off) strategy — Updates will only be applied if the operator further configures locksmithd or the machine is rebooted/crashes.
I think that option 2 may be preferable to option 1 since it makes it easier for a cluster operator to later migrate to using something like CLUO to coordinate updates in a Kubernetes-aware way.
The fact that new machines in the ASG will launch with the old ami will also provide a semi-natural rollback if things do go wrong and the user shuts down misbehaving machines, so breakages can be remediated fairly easily still.
Even though the ASG will still launch older machines, I do think providing an easier migration path to updating them forwards via leaving update-engine there is worthwhile.
As a final note, I think one additional thing that could help make sure updates are less likely to cause issues would be making it easier to run clusters on the beta channel (where users could run a staging / pre-prod cluster and catch issues earlier), or possibly make it possible to mix in some instances running on the beta channel.
I think if that option existed, stable updates would be less likely to cause problems for kops without advance warning since almost all stable changes are promoted through the beta channel.
Let me know if you have questions about the above.
Best,
Euan
nodeup/pkg/model/update_service.go
Outdated
|
||
manifest.Set("Unit", "Before", "update-engine.service locksmithd.service") | ||
manifest.Set("Service", "Type", "oneshot") | ||
manifest.Set("Service", "ExecStartPre", "/bin/sh -c 'systemctl mask --now update-engine.service && systemctl mask --now locksmithd.service'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice this is being run for Google's Container-optimized OS as well. Does it have locksmithd.service? If it doesn't, this would fail there, and I'm not sure it does.
There may need to be two separate services
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes good spot, I'll update and test
Hey @euank, many thanks for the detailed feedback! :) Option 2 does sound like a more sensible approach in this scenario, and we could also add the ability for users to enable the CLUO deployment for their Cluster if they want it (although there's still the case that the ASG would launch an outdated AMI). I'll make the suggested update to this PR, would also be good to hear others thoughts on CLUO and whether we should add that capability to deploy it for the user within Kops here. Thanks! |
28c20a8
to
fb3702b
Compare
We could probably do an e2e test with CoreOS running on the beta channel - I don't think anything is stopping anyone from specifying a beta or alpha AMI, but we didn't learn of this until it was too late. I guess this is correct; I think we probably should tie it into the existing @KashifSaadat would you be OK with tweaking this to key off |
@euank thanks for the review btw! |
fb3702b
to
3d1203f
Compare
@justinsb sure, I've just updated the PR. Users will need to specify the following within their ClusterSpec for this change to take effect (only disabling locksmithd):
|
LGTM |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gambol99, KashifSaadat The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This option doesn't appear to have been documented? |
@jpds it's documented here although it still says that unset means no reboots: https://github.com/kubernetes/kops/blob/master/docs/arguments.md#updatepolicy |
This patch came up in discussion in the "kops-users" Slack channel. |
Fixes #4458
Disable
locksmithd.service
if running CoreOS, and the following is set within your kops ClusterSpec: