CoreOS - locksmithd semaphore on updates on local(non clustered) etcd, rebooting all nodes at once. #5407

tommij · 2018-07-05T07:18:09Z

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

What kops version are you running? The command kops version, will display
this information.

Version 1.9.1 (git-ba77c9ca2)

What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

GitVersion:"v1.9.6"

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

export NAME=redacted.k8s.some.tld
export KOPS_ZONES="us-west-2a,us-west-2b,us-west-2c"
export KOPS_MASTER_ZONES=${KOPS_ZONES}
export KOPS_STATE_STORE=s3://redacted-kops-state
export ADMIN_IP_RANGES="redacted/cidr-1, redacted/cidr-2"
export ADMIN_NODE_SIZE="t2.small"
export NODE_SIZE="t2.micro"
export TERRAFORM_VPC_TARGET="redacted"
export SSH_PUBKEY=$(mktemp)
export COREOS_VERSION="1745.7.0"
echo kops create cluster \
  --zones=${KOPS_ZONES} \
  --master-zones=${KOPS_MASTER_ZONES} \
  --cloud=aws \
  --ssh-access=${ADMIN_IP_RANGES} \
  --admin-access=${ADMIN_IP_RANGES} \
  --node-size=${NODE_SIZE} \
  --master-size=${ADMIN_NODE_SIZE} \
  --node-count=2 \
  --target=terraform \
  --ssh-public-key=${SSH_PUBKEY} \
  --vpc="$TARGET_VPC) \
  --image=coreos.com/CoreOS-stable-${COREOS_VERSION}-hvm \
  --out=modules/kops/${NAME} \
  $NAME

What happened after the commands executed?

Cluster came up as expected, problem elsewhere

What did you expect to happen?

Cluster came up as expected, problem elsewhere

Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-06-26T11:42:01Z
  name: redacted.k8s.some.tld
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://redacted
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - redacted/32
  - redacted/32
  kubernetesVersion: 1.9.6
  masterPublicName: api.prod-poc.k8s.p3d.in
  networkCIDR: 10.242.0.0/16
  networkID: vpc-48889e31
  networking:
    kubenet: {}
  nonMasqueradeCIDR: redacted/10
  sshAccess:
  - redacted/32
  - redacted/32
  subnets:
  - cidr: 10.242.32.0/19
    name: us-west-2a
    type: Public
    zone: us-west-2a
  - cidr: 10.242.64.0/19
    name: us-west-2b
    type: Public
    zone: us-west-2b
  - cidr: 10.242.96.0/19
    name: us-west-2c
    type: Public
    zone: us-west-2c
  topology:
    dns:
    type: Public
    masters: public
    nodes: public

Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

Can't do it, cluster is down, and it's irellevant as this has to do with CoreOS configuration.

Anything else do we need to know?

from : https://coreos.com/os/docs/latest/update-strategies.html

The overarching goal of Container Linux is to secure the Internet's backend infrastructure. We believe that automatically updating the operating system is one of the best tools to achieve this goal.

STRATEGY	DESCRIPTION
etcd-lock	Reboot after first taking a distributed lock in etcd
reboot	Reboot immediately after an update is applied
off	Do not reboot after updates are applied

defaults to etcd-lock. However without additional configuration, the etcd semaphore lock is run in a local, non-clustered etcd, potentially causing all nodes to reboot at once on-upgrade.

To replicate:

run "locksmithctl reboot" or "locksmithctl send-need-reboot" concurrently on all nodes:

Expected behavior: one node reboots at a time using a clustered etcd semaphore.

What happens: all nodes reboot immediately, because master nodes can get a semaphore locally, and worker nodes doesn't have one running.

Attempts at adding cloud-config to the template fails, as coreos doesn't support Multipart/MIME - and attempts at making it coexist with the KOPS bash script is troublesome at best, because coreos ignition runs at initramfs

Attempts to mitigate via (deprecated) coreos-cloudinit --from-file was somewhat successful.

#cloud-config
write_files:
  - path: "/etc/coreos/update.conf"
    permissions: "0644"
    owner: "root"
    content: |
      GROUP=stable
      REBOOT_STRATEGY=etcd-lock
      LOCKSMITHCTL_ENDPOINT=http://etcd-a.internal.redacted.k8s.some.tld:4001,http://etcd-b.internal.redacted.k8s.some.tld:
4001,http://etcd-c.internal.redacted.some.tld:4001
EOF

could (probably the least fiddly application) be done via coreos discovery as well:
https://coreos.com/os/docs/latest/cluster-discovery.html.

------------- FEATURE REQUEST TEMPLATE --------------------

Describe IN DETAIL the feature/behavior/change you would like to see.

CoreOS etcd clustering set up by KOPS, so default, automatic updates of CoreOS, doesn't potentially down all nodes in a cluster at once

Feel free to provide a design supporting your feature request.

The text was updated successfully, but these errors were encountered:

KashifSaadat · 2018-07-05T08:04:31Z

Hi @tommij, you can set the following in your kops ClusterSpec to disable locksmithd when a CoreOS image is used:

  updatePolicy: external

You should be able to use FileAssets in your ClusterSpec to override the update.conf file, for example:

  fileAssets:
  - name: coreos-update-config
    path: /etc/coreos/update.conf
    content: |
      GROUP=stable
      REBOOT_STRATEGY=etcd-lock
      ...

tommij · 2018-07-05T08:29:31Z

Cheers @KashifSaadat - I realise that the file could be done via fileasset, but figured hard overwriting a file which may have future required fields could be a bad idea.

Turns out, that's what the ignition config does nonetheless.

Was unaware of the "external" updatepolicy, wouldn't that be counterproductive to running CoreOS to begin with? What with the idea of running CoreOS is having nodes be updated at all times.

All that aside, as CoreOS is considered ready for production by KOPS, would either solution not be something to default (no updates, discovery or locksmithctl targets to the internal master dns records).

I wouldn't normally assume default settings in something production ready could make my cluster unavailable.

KashifSaadat · 2018-07-05T08:48:37Z

No worries. The addition of updatePolicy for CoreOS was a quick fix for users to avoid automatic reboots, but there's a more detailed response from one of the CoreOS maintainers here, listing a few potential options on the way forward: #4909 (review)

We could potentially look at making this option a default for CoreOS, as it would of course be undesired behaviour for kops users to have their nodes all unexpectedly restart at the same time. If you want to take a look at raising a PR that'd be great. :)

tommij · 2018-07-06T08:59:19Z

closing this, as it's already covered elsewhere, may make make a PR with HA documentation update and possibly looking into coreos provisioning PR

Thanks @KashifSaadat

tommij closed this as completed Jul 5, 2018

tommij changed the title ~~CoreOS~~ CoreOS - locksmithd semaphore on updates on local(non clustered) etcd, rebooting all nodes at once. Jul 5, 2018

tommij reopened this Jul 5, 2018

tommij closed this as completed Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreOS - locksmithd semaphore on updates on local(non clustered) etcd, rebooting all nodes at once. #5407

CoreOS - locksmithd semaphore on updates on local(non clustered) etcd, rebooting all nodes at once. #5407

tommij commented Jul 5, 2018 •

edited

Loading

KashifSaadat commented Jul 5, 2018 •

edited

Loading

tommij commented Jul 5, 2018

KashifSaadat commented Jul 5, 2018

tommij commented Jul 6, 2018

CoreOS - locksmithd semaphore on updates on local(non clustered) etcd, rebooting all nodes at once. #5407

CoreOS - locksmithd semaphore on updates on local(non clustered) etcd, rebooting all nodes at once. #5407

Comments

tommij commented Jul 5, 2018 • edited Loading

KashifSaadat commented Jul 5, 2018 • edited Loading

tommij commented Jul 5, 2018

KashifSaadat commented Jul 5, 2018

tommij commented Jul 6, 2018

tommij commented Jul 5, 2018 •

edited

Loading

KashifSaadat commented Jul 5, 2018 •

edited

Loading