Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable locksmithd on CoreOS if UpdatePolicy set #4909

Merged
merged 1 commit into from
Apr 10, 2018

Conversation

KashifSaadat
Copy link
Contributor

@KashifSaadat KashifSaadat commented Apr 4, 2018

Fixes #4458

Disable locksmithd.service if running CoreOS, and the following is set within your kops ClusterSpec:

spec:
  updatePolicy: external

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 4, 2018
@KashifSaadat KashifSaadat force-pushed the coreos-disable-updates branch from 9f4a42b to 843b82b Compare April 4, 2018 16:37
@KashifSaadat
Copy link
Contributor Author

/assign @chrislovecnm @justinsb

Wasn't sure whether you'd prefer for this to be wrapped in an API flag? I noticed the flag UpdatePolicy, however the default value for that flag (missing) states applying upgrades automatically, so users would need to explicitly set the value to external if using CoreOS.

@@ -2125,6 +2125,7 @@ func autoConvert_v1alpha1_KubeProxyConfig_To_kops_KubeProxyConfig(in *KubeProxyC
out.HostnameOverride = in.HostnameOverride
out.Master = in.Master
out.Enabled = in.Enabled
out.ProxyMode = in.ProxyMode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason we are missing this ... we will need to get this into the 1.9 release as well. I am pushing a pr.

Copy link
Contributor Author

@KashifSaadat KashifSaadat Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great, cheers 👍 It somehow got past the tests and merged into master, but the test run in this PR failed.

Copy link

@euank euank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, I work on CoreOS Container Linux and wanted to mention some other options worth considering.

First of all, we're sorry stable broke users here. We do our best to catch issues before they hit users, but the /dev/ path thing wasn't hit in our tests nor reported by users and so slipped through.

We obviously prefer if our users do remain updated and secure, so I think it would be preferable to do something a little less aggressive than masking both update-engine and locksmthid.

The options are basically the following (as discussed in the docs):

  1. Disable update-engine + locksmithd — Updates are not downloaded or applied at any time
  2. Disable locksmithd — Updates are downloaded and prepared, but wait for an alternate update-coordinator (such as CLUO) or other reboot (e.g. operator reboot or crash) to apply.
  3. Configure locksmithd to use the etcd-lock (or off) strategy — Updates will only be applied if the operator further configures locksmithd or the machine is rebooted/crashes.

I think that option 2 may be preferable to option 1 since it makes it easier for a cluster operator to later migrate to using something like CLUO to coordinate updates in a Kubernetes-aware way.
The fact that new machines in the ASG will launch with the old ami will also provide a semi-natural rollback if things do go wrong and the user shuts down misbehaving machines, so breakages can be remediated fairly easily still.

Even though the ASG will still launch older machines, I do think providing an easier migration path to updating them forwards via leaving update-engine there is worthwhile.

As a final note, I think one additional thing that could help make sure updates are less likely to cause issues would be making it easier to run clusters on the beta channel (where users could run a staging / pre-prod cluster and catch issues earlier), or possibly make it possible to mix in some instances running on the beta channel.
I think if that option existed, stable updates would be less likely to cause problems for kops without advance warning since almost all stable changes are promoted through the beta channel.

Let me know if you have questions about the above.
Best,
Euan


manifest.Set("Unit", "Before", "update-engine.service locksmithd.service")
manifest.Set("Service", "Type", "oneshot")
manifest.Set("Service", "ExecStartPre", "/bin/sh -c 'systemctl mask --now update-engine.service && systemctl mask --now locksmithd.service'")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice this is being run for Google's Container-optimized OS as well. Does it have locksmithd.service? If it doesn't, this would fail there, and I'm not sure it does.

There may need to be two separate services

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good spot, I'll update and test

@KashifSaadat
Copy link
Contributor Author

Hey @euank, many thanks for the detailed feedback! :)

Option 2 does sound like a more sensible approach in this scenario, and we could also add the ability for users to enable the CLUO deployment for their Cluster if they want it (although there's still the case that the ASG would launch an outdated AMI).

I'll make the suggested update to this PR, would also be good to hear others thoughts on CLUO and whether we should add that capability to deploy it for the user within Kops here.

Thanks!
Kash

@KashifSaadat KashifSaadat force-pushed the coreos-disable-updates branch from 28c20a8 to fb3702b Compare April 9, 2018 08:25
@justinsb
Copy link
Member

justinsb commented Apr 9, 2018

We could probably do an e2e test with CoreOS running on the beta channel - I don't think anything is stopping anyone from specifying a beta or alpha AMI, but we didn't learn of this until it was too late.

I guess this is correct; I think we probably should tie it into the existing UpdatePolicy field on the cluster though - by default we should probably continue to update as best we can, at least until we have a more managed update process. I'm not sure that in-place updates make sense on clouds / virtual infrastructure, but we don't currently have an operator that reflects that, and are still planning bare metal support.

@KashifSaadat would you be OK with tweaking this to key off UpdatePolicy? I don't want to surprise existing users that suddenly find their clusters are not getting updates, even if the current update strategy has caused problems..

@chrislovecnm
Copy link
Contributor

@euank thanks for the review btw!

@KashifSaadat KashifSaadat force-pushed the coreos-disable-updates branch from fb3702b to 3d1203f Compare April 10, 2018 12:05
@KashifSaadat KashifSaadat changed the title Disable update-engine by default on CoreOS Disable locksmithd on CoreOS if UpdatePolicy set Apr 10, 2018
@KashifSaadat
Copy link
Contributor Author

@justinsb sure, I've just updated the PR. Users will need to specify the following within their ClusterSpec for this change to take effect (only disabling locksmithd):

spec:
  updatePolicy: external

@gambol99
Copy link
Contributor

gambol99 commented Apr 10, 2018

LGTM

@gambol99
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 10, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gambol99, KashifSaadat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [KashifSaadat,gambol99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 7b601e6 into kubernetes:master Apr 10, 2018
@KashifSaadat KashifSaadat deleted the coreos-disable-updates branch April 10, 2018 13:35
@jpds
Copy link
Contributor

jpds commented May 9, 2018

This option doesn't appear to have been documented?

@davidarcher
Copy link
Contributor

@jpds it's documented here although it still says that unset means no reboots:

https://github.com/kubernetes/kops/blob/master/docs/arguments.md#updatepolicy

@seh
Copy link
Contributor

seh commented Mar 31, 2021

This patch came up in discussion in the "kops-users" Slack channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default setting for CoreOS automatic updates should be off
9 participants