Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreOS - locksmithd semaphore on updates on local(non clustered) etcd, rebooting all nodes at once. #5407

Closed
tommij opened this issue Jul 5, 2018 · 4 comments

Comments

@tommij
Copy link

tommij commented Jul 5, 2018

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

  1. What kops version are you running? The command kops version, will display
    this information.

Version 1.9.1 (git-ba77c9ca2)

  1. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.

GitVersion:"v1.9.6"

  1. What cloud provider are you using?

AWS

  1. What commands did you run? What is the simplest way to reproduce this issue?
export NAME=redacted.k8s.some.tld
export KOPS_ZONES="us-west-2a,us-west-2b,us-west-2c"
export KOPS_MASTER_ZONES=${KOPS_ZONES}
export KOPS_STATE_STORE=s3://redacted-kops-state
export ADMIN_IP_RANGES="redacted/cidr-1, redacted/cidr-2"
export ADMIN_NODE_SIZE="t2.small"
export NODE_SIZE="t2.micro"
export TERRAFORM_VPC_TARGET="redacted"
export SSH_PUBKEY=$(mktemp)
export COREOS_VERSION="1745.7.0"
echo kops create cluster \
  --zones=${KOPS_ZONES} \
  --master-zones=${KOPS_MASTER_ZONES} \
  --cloud=aws \
  --ssh-access=${ADMIN_IP_RANGES} \
  --admin-access=${ADMIN_IP_RANGES} \
  --node-size=${NODE_SIZE} \
  --master-size=${ADMIN_NODE_SIZE} \
  --node-count=2 \
  --target=terraform \
  --ssh-public-key=${SSH_PUBKEY} \
  --vpc="$TARGET_VPC) \
  --image=coreos.com/CoreOS-stable-${COREOS_VERSION}-hvm \
  --out=modules/kops/${NAME} \
  $NAME
  1. What happened after the commands executed?

Cluster came up as expected, problem elsewhere

  1. What did you expect to happen?

Cluster came up as expected, problem elsewhere

  1. Please provide your cluster manifest. Execute
    kops get --name my.example.com -o yaml to display your cluster manifest.
    You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-06-26T11:42:01Z
  name: redacted.k8s.some.tld
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://redacted
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - redacted/32
  - redacted/32
  kubernetesVersion: 1.9.6
  masterPublicName: api.prod-poc.k8s.p3d.in
  networkCIDR: 10.242.0.0/16
  networkID: vpc-48889e31
  networking:
    kubenet: {}
  nonMasqueradeCIDR: redacted/10
  sshAccess:
  - redacted/32
  - redacted/32
  subnets:
  - cidr: 10.242.32.0/19
    name: us-west-2a
    type: Public
    zone: us-west-2a
  - cidr: 10.242.64.0/19
    name: us-west-2b
    type: Public
    zone: us-west-2b
  - cidr: 10.242.96.0/19
    name: us-west-2c
    type: Public
    zone: us-west-2c
  topology:
    dns:
    type: Public
    masters: public
    nodes: public
  1. Please run the commands with most verbose logging by adding the -v 10 flag.
    Paste the logs into this report, or in a gist and provide the gist link here.

Can't do it, cluster is down, and it's irellevant as this has to do with CoreOS configuration.

  1. Anything else do we need to know?

from : https://coreos.com/os/docs/latest/update-strategies.html

The overarching goal of Container Linux is to secure the Internet's backend infrastructure. We believe that automatically updating the operating system is one of the best tools to achieve this goal.

STRATEGY DESCRIPTION
etcd-lock Reboot after first taking a distributed lock in etcd
reboot Reboot immediately after an update is applied
off Do not reboot after updates are applied

defaults to etcd-lock. However without additional configuration, the etcd semaphore lock is run in a local, non-clustered etcd, potentially causing all nodes to reboot at once on-upgrade.

To replicate:

run "locksmithctl reboot" or "locksmithctl send-need-reboot" concurrently on all nodes:

Expected behavior: one node reboots at a time using a clustered etcd semaphore.

What happens: all nodes reboot immediately, because master nodes can get a semaphore locally, and worker nodes doesn't have one running.

Attempts at adding cloud-config to the template fails, as coreos doesn't support Multipart/MIME - and attempts at making it coexist with the KOPS bash script is troublesome at best, because coreos ignition runs at initramfs

Attempts to mitigate via (deprecated) coreos-cloudinit --from-file was somewhat successful.

#cloud-config
write_files:
  - path: "/etc/coreos/update.conf"
    permissions: "0644"
    owner: "root"
    content: |
      GROUP=stable
      REBOOT_STRATEGY=etcd-lock
      LOCKSMITHCTL_ENDPOINT=http://etcd-a.internal.redacted.k8s.some.tld:4001,http://etcd-b.internal.redacted.k8s.some.tld:
4001,http://etcd-c.internal.redacted.some.tld:4001
EOF

could (probably the least fiddly application) be done via coreos discovery as well:
https://coreos.com/os/docs/latest/cluster-discovery.html.

------------- FEATURE REQUEST TEMPLATE --------------------

  1. Describe IN DETAIL the feature/behavior/change you would like to see.

CoreOS etcd clustering set up by KOPS, so default, automatic updates of CoreOS, doesn't potentially down all nodes in a cluster at once

  1. Feel free to provide a design supporting your feature request.
@tommij tommij closed this as completed Jul 5, 2018
@tommij tommij changed the title CoreOS CoreOS - locksmithd semaphore on updates on local(non clustered) etcd, rebooting all nodes at once. Jul 5, 2018
@tommij tommij reopened this Jul 5, 2018
@KashifSaadat
Copy link
Contributor

KashifSaadat commented Jul 5, 2018

Hi @tommij, you can set the following in your kops ClusterSpec to disable locksmithd when a CoreOS image is used:

  updatePolicy: external

You should be able to use FileAssets in your ClusterSpec to override the update.conf file, for example:

  fileAssets:
  - name: coreos-update-config
    path: /etc/coreos/update.conf
    content: |
      GROUP=stable
      REBOOT_STRATEGY=etcd-lock
      ...

@tommij
Copy link
Author

tommij commented Jul 5, 2018

Cheers @KashifSaadat - I realise that the file could be done via fileasset, but figured hard overwriting a file which may have future required fields could be a bad idea.

Turns out, that's what the ignition config does nonetheless.

Was unaware of the "external" updatepolicy, wouldn't that be counterproductive to running CoreOS to begin with? What with the idea of running CoreOS is having nodes be updated at all times.

All that aside, as CoreOS is considered ready for production by KOPS, would either solution not be something to default (no updates, discovery or locksmithctl targets to the internal master dns records).

I wouldn't normally assume default settings in something production ready could make my cluster unavailable.

@KashifSaadat
Copy link
Contributor

No worries. The addition of updatePolicy for CoreOS was a quick fix for users to avoid automatic reboots, but there's a more detailed response from one of the CoreOS maintainers here, listing a few potential options on the way forward: #4909 (review)

We could potentially look at making this option a default for CoreOS, as it would of course be undesired behaviour for kops users to have their nodes all unexpectedly restart at the same time. If you want to take a look at raising a PR that'd be great. :)

@tommij
Copy link
Author

tommij commented Jul 6, 2018

closing this, as it's already covered elsewhere, may make make a PR with HA documentation update and possibly looking into coreos provisioning PR

Thanks @KashifSaadat

@tommij tommij closed this as completed Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants