Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS authenticator stops master from joining cluster #6154

Closed
phillipj opened this issue Dec 4, 2018 · 49 comments
Closed

AWS authenticator stops master from joining cluster #6154

phillipj opened this issue Dec 4, 2018 · 49 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@phillipj
Copy link

phillipj commented Dec 4, 2018

1. What kops version are you running? The command kops version, will display
this information.

$ kops version
Version 1.10.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

$ kops edit cluster
# .. added authentication section
$ kops rolling-update cluster --instance-group-roles=Master --force --yes

The authentication section added:

authentication:
  aws: {}

as described in: ./docs/authentication.md#aws-iam-authenticator

5. What happened after the commands executed?

kops drained and stopped the first master node. A new EC2 instance was created by AWS auto scaling group, but that instance was never able to join the cluster. The cluster validation step therefore failed.

$ kops rolling-update cluster --instance-group-roles=Master --force --yes

I1204 11:13:18.976387   85319 instancegroups.go:157] Draining the node: "ip-172-31-26-101.x.y.z".
node "ip-172-31-26-101.x.y.z" cordoned
node "ip-172-31-26-101.x.y.z" cordoned
node "ip-172-31-26-101.x.y.z" drained
I1204 11:13:34.924664   85319 instancegroups.go:338] Waiting for 1m30s for pods to stabilize after draining.
I1204 11:15:04.911031   85319 instancegroups.go:278] Stopping instance "i-0d7d0c586f9a7", node "ip-172-31-26-101.x.y.z", in group "master-x.masters.k8s.y.z" (this may take a while).
I1204 11:20:06.535982   85319 instancegroups.go:188] Validating the cluster.
I1204 11:20:14.653181   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:20:48.953718   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:21:18.639329   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:21:48.260554   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:22:18.447799   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.

master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duation of \"5m0s\""

Seeing this in the master nodes logs:

main.go:142] got error running nodeup (will retry in 30s): error building loader: certificate "aws-iam-authenticator" not found
s3fs.go:219] Reading file "s3://x-kops-state/k8s.x.y.z/pki/issued/aws-iam-authenticator/keyset.yaml"

6. What did you expect to happen?

All the master nodes to be updated with the aws-iam-authenticator enabled & ready for action.

/cc @rdrgmnzs

@phillipj
Copy link
Author

phillipj commented Dec 4, 2018

Any thoughts on what I might be missing before starting the rolling update would be much appreciated!

As far as I've understood, it's not strictly required to pre-create a certificate, but this error might suggest otherwise?

Refs kubernetes-sigs/aws-iam-authenticator/README.md

@phillipj
Copy link
Author

phillipj commented Dec 4, 2018

While digging through this, I assume I found the origin of certificate "aws-iam-authenticator" not found in ./nodeup/pkg/model/kube_apiserver.go#L232:

certificate, err := b.NodeupModelContext.KeyStore.FindCert(id)
if err != nil {
  return fmt.Errorf("error fetching %q certificate from keystore: %v", id, err)
}
if certificate == nil {
  return fmt.Errorf("certificate %q not found", id)
}

Should there be a corresponding "create certificate" method invoked or is that handled somewhere else programatically (or manually)?

@phillipj
Copy link
Author

phillipj commented Dec 4, 2018

Ohmy I just realised I've completely missed the Kops usage section in aws-iam-authenticator. There's certainly mention of a couple of certificates there that I'll try to create.

I'll close this issue tomorrow if it turns out to be the fix I've been looking for.

@qlikcoe
Copy link
Contributor

qlikcoe commented Dec 4, 2018

I'm pretty sure back in September, 2018 I had this working fine with just adding

authentication:
  aws: {}

section to my cluster manifest.
Just yesterday I decided to try it again and when I'm trying to bootstrap a cluster with aws: {}, my cluster isn't even validated.

@qlikcoe
Copy link
Contributor

qlikcoe commented Dec 4, 2018

Also, kops doc also doesn't say you have to do anything else other than adding aws: {} piece:
https://github.com/kubernetes/kops/blob/master/docs/authentication.md

@rdrgmnzs
Copy link
Contributor

rdrgmnzs commented Dec 5, 2018

@phillipj as @qlikcoe mentioned you should not need to create those certs they should automatically be created for you. The only thing you'd need to create is the aws-iam-authenticator configuration.

Just to make sure we're covering everything did you also add the required rbac stanza to the kops config as describe in https://github.com/kubernetes/kops/blob/master/docs/authentication.md#aws-iam-authenticator?

authorization:
rbac: {}

@phillipj
Copy link
Author

phillipj commented Dec 5, 2018

Thanks for much needed input guys!

@rdrgmnzs no I have not enabled RBAC yet. I was hoping to do that as the next step after getting authorisation in place across all our clusters. I'll give that a shot within a couple of hours and see if that has any effect on getting this authenticator on its feet.

@phillipj
Copy link
Author

phillipj commented Dec 5, 2018

Sadly no difference as far as I can see. The master node's nodeup sequence stops because of the same error:

main.go:142] got error running nodeup (will retry in 30s): error building loader: certificate "aws-iam-authenticator" not found
s3fs.go:219] Reading file "s3://x-kops-state/k8s.x.y.z/pki/issued/aws-iam-authenticator/keyset.yaml"
task.go:98] task *nodetasks.UserTask does not implement HasLifecycle
task.go:98] task *nodetasks.File does not implement HasLifecycle
s3fs.go:219] Reading file "s3://x-kops-state/k8s.x.y.z/pki/issued/ca/keyset.yaml"

Can also confirm s3://x-kops-state/k8s.x.y.z/pki/issued/aws-iam-authenticator does not exist at all in the S3 bucket.

@rdrgmnzs mind sharing where/what is supposed to create certificate, as I haven't found any hints of that in kops?

I tried manually applying ./upup/models/cloudup/resources/addons/authentication.aws/k8s-1.10.yaml directly to see if the automatic certificate creation inside aws-iam-authenticator would be the answer, but it failed because the process did not have permissions to write:

time="2018-12-05T07:36:08Z" level=info msg="generated a new private key and certificate" certBytes=810 keyBytes=1192
time="2018-12-05T07:36:08Z" level=info msg="saving new key and certificate" certPath=/var/aws-iam-authenticator/cert.pem keyPath=/var/aws-iam-authenticator/key.pem
time="2018-12-05T07:36:08Z" level=fatal msg="could not load/generate a certificate" error="open /var/aws-iam-authenticator/cert.pem: permission denied"

@rdrgmnzs
Copy link
Contributor

rdrgmnzs commented Dec 5, 2018

@phillipj
Copy link
Author

phillipj commented Dec 5, 2018

Great, really appreciated!

Thanks again for some really useful hints, especially the one in ./pkg/model/pki.go that I haven't stumbled upon yet.

@phillipj
Copy link
Author

phillipj commented Dec 5, 2018

Realised it's probably worth emphasising I'm adding the authenticator to already existing clusters, not launching new ones where the authenticator is present from the get-go.

@qlikcoe
Copy link
Contributor

qlikcoe commented Dec 5, 2018

@rdrgmnzs , I do have the rbac piece for authorization.

So, right now if I'm bootstrapping the cluster from scratch, with aws-iam-authenticator enabled, the cluster does not validate.

If I'm enabling this for an existing cluster, all three aws-iam-authenticator pods are failing with this:

time="2018-12-04T18:47:10Z" level=info msg="mapping IAM role" groups="[system:masters]" role="arn:aws:iam::XXXXXXXXX:role/SRE_Admin" username="sre-admin:{{SessionName}}"
time="2018-12-04T18:47:10Z" level=info msg="mapping IAM role" groups="[system:masters]" role="arn:aws:iam::XXXXXXXXX:role:role/KubernetesAdmin-coe-eu-west-1" username="kubernetes-admin:{{SessionName}}"
time="2018-12-04T18:47:10Z" level=info msg="mapping IAM role" groups="[heptio-read-only]" role="arn:aws:iam::XXXXXXXXX:role:role/KubernetesReadOnly-coe-eu-west-1" username="kubernetes-read-only:{{SessionName}}"
time="2018-12-04T18:47:16Z" level=info msg="generated a new private key and certificate" certBytes=810 keyBytes=1190
time="2018-12-04T18:47:16Z" level=info msg="saving new key and certificate" certPath=/var/aws-iam-authenticator/cert.pem keyPath=/var/aws-iam-authenticator/key.pem
time="2018-12-04T18:47:16Z" level=fatal msg="could not load/generate a certificate" error="open /var/aws-iam-authenticator/cert.pem: permission denied"

@qlikcoe
Copy link
Contributor

qlikcoe commented Dec 5, 2018

I just switched back to kops 1.9.2, and k8s 1.9.8 and tried to bootstrap the cluster from scratch with aws authenticator enabled in the cluster manifest. No issues, cluster was up and validated...

@qlikcoe
Copy link
Contributor

qlikcoe commented Dec 5, 2018

I'm getting closer now...
So back to kops 1.10.0 and k8s 1.10.11.
If I create the cluster from scratch, it doesn't ever get validated because aws-iam-authenticator pods are crashing.

Here's the log


  Warning  FailedMount            1m (x10 over 5m)  kubelet, ip-10-241-78-35.eu-west-1.compute.internal  MountVolume.SetUp failed for volume "config" : configmaps "aws-iam-authenticator" not found
  Warning  FailedMount            55s (x2 over 3m)  kubelet, ip-10-241-78-35.eu-west-1.compute.internal  Unable to mount volumes for pod "aws-iam-authenticator-gsrcq_kube-system(f969fa4a-f8af-11e8-9f67-06a4300bcc08)": timeout expired waiting for volumes to attach or mount for pod "kube-system"/"aws-iam-authenticator-gsrcq". list of unmounted volumes=[config]. list of unattached volumes=[config output state default-token-gdq9v]

And obviously the configmap is being created after the cluster is up and running, so there is a racing condition.

Imported the configmap manually, restarted the pods and they are running fine now. Just one line in the logs is worrying me

time="2018-12-05T17:17:23Z" level=info msg="reconfigure your apiserver with `--authentication-token-webhook-config-file=/etc/kubernetes/heptio-authenticator-aws/kubeconfig.yaml` to enable (assuming default hostPath mounts)"

@phillipj
Copy link
Author

phillipj commented Dec 5, 2018

I also got some new findings while doing more trial-n-error today.

I've now got three clusters with the authenticator deployed & working as expected. Got it installed more or less painlessly in two of them, while the third one was a tough nut to crack. After trying rolling update on that cluster's master nodes at least 5+ times, it suddenly just worked. Meaning the certificate error I've described in this issue and had seen for a couple of days, went away and the aws-iam-authenticator pods started showing up and worked on the rolled nodes.

The only thing I did differently that last time on the troublesome cluster, was running the rolling-update command with -v 10 in hopes that would provide useful debugging info.

An interesting thing I've observed is that when invoking

$ kops rolling-update cluster $CLUSTER_NAME --instance-group-roles=Master --force --yes

on the clusters it works on, the aws-iam-authenticator pods appears immediately when listing running pods in a separate terminal (while the rolling update is still in progress):

$ kubectl get pods --namespace=kube-system
NAME                           READY   STATUS    RESTARTS   AGE   IP              NODE
aws-iam-authenticator-5wqpp    1/1     Running   0          11d   172.26.30.2     ip-172-26-30-2.x.internal
aws-iam-authenticator-ccbgm    1/1     Running   0          11d   172.26.61.194   ip-172-26-61-194.x.internal
aws-iam-authenticator-rgp6n    1/1     Running   0          11d   172.26.41.187   ip-172-26-41-187.x.internal

...

If those pods does not appear immediately, it's always due to the certificate error I've described earlier, causing the nodeup procedure to fail thus preventing those master nodes to join the cluster.

Since I've experienced this somewhat intermittently, works on some clusters right away, others might need several attempts, could this be an (async) timing issue or similar? I've got yet another four clusters to go before I'm done, so it'll be interesting to see if how often this occurs and if I'm able to find a reproducible fix.

In a rolling update procedure, when is ./pkg/model/pki.go executed vs the code that expects those pki's to have been put in place (./nodeup/pkg/model/kube_apiserver.go)?

P.S. as a follow-up to @qlikcoe's comment about missing ConfigMap, I've already created those in all these clusters before kicking off the rolling updates.

@rdrgmnzs
Copy link
Contributor

rdrgmnzs commented Dec 6, 2018

Thanks for all the info and debugging. The certs are pulled during the host provisioning process with protokube, which is why I'm baffled by this issue. If you happen to hit this again, could you please take a look at the protokube logs and check if it is failing to pull the certs for some reason.

@phillipj
Copy link
Author

phillipj commented Dec 6, 2018

Aha, yeah I'll look more into the protokube logs going forward.

I've gotten more suspicious about what makes the certificate creation to be kicked off in the right moment though. Cause in the clusters I've had issues with so far, no certificates has been created and put in the S3 kops state bucket before the nodes gets rolled. That would also mean there's nothing to be downloaded for protokube, as long as it isn't responsible for creating by running the ./pkg/model/pki.go procedure as well?

@phillipj
Copy link
Author

phillipj commented Dec 6, 2018

Ran into this again with a new cluster today. Wanted to get a look inside protokube logs but sadly wasn't able to;

# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
#

Assuming the nodeup procedure gets errors too early in the process, so protokube isn't even started.

Earlier today I was able to fix a troublesome cluster with these steps:

# 1. reverting authentication config in cluster
$ kops edit cluster

# 2. trigger a cloud only rolling update
$ kops rolling-update cluster $CLUSTER_NAME --instance-group-roles=Master --force --yes --cloudonly

# 3. re-introduce authentication config again
$ kops edit cluster

# 4. trigger normal rolling update again
$ kops rolling-update cluster $CLUSTER_NAME --instance-group-roles=Master --force --yes

Ended up doing a cloud only update in step number 2 because 2 of 3 master nodes was not in the cluster because of these certification errors at that point. Could therefore not run an ordinary rolling update because kops (nor kubectl for that matter) wasn't able to communicate/validate the cluster at all.

On second though, with the first cluster I fixed this issue in, I'm pretty certain I also did a revert & cloud only rolling update to get it on its feet again, just before running rolling-update with -v 10 as described in my comment from yesterday.

@rdrgmnzs is getting hold of the configured cluster authentication data from two different places deliberately done in ./pkg/model/pki.go#L296? Or could this be a source of inconsistent state for some reason? (yes, I'm wildly guessing here)

if b.Cluster.Spec.Authentication != nil {
  if b.KopsModelContext.Cluster.Spec.Authentication.Aws != nil {
    ...
  }
}

@phillipj
Copy link
Author

I've put further work on this on hold until I get some more input on what might be causing this to happen.

Still got three production clusters that needs the authenticator. Tried deploying it to two of them, both of them failed with the certification error described in in this issue originally. As I don't want to cause unnecessary service disruptions in production clusters by doing several rolling updates attempts in a row, in hopes it suddenly just works, I've put this on hold as mentioned.

@rdrgmnzs any chance you're able to answer my latest question in my comment above?

@prageethw
Copy link

prageethw commented Jan 20, 2019

@phillipj @rdrgmnzs I can confirm I'm seeing the same issue my kops config is as below

  authentication:
    aws: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws

kops logs

WARNING: Ignoring DaemonSet-managed pods: aws-iam-authenticator-r4stk
pod/dns-controller-547884bc7f-v7ks9 evicted
I0121 00:47:37.517693   11370 instancegroups.go:358] Waiting for 1m30s for pods to stabilize after draining.
I0121 00:49:07.518841   11370 instancegroups.go:185] deleting node "ip-172-20-58-7.us-east-2.compute.internal" from kubernetes
I0121 00:49:07.809931   11370 instancegroups.go:299] Stopping instance "i-0388f700d7cfba86e", node "ip-172-20-58-7.us-east-2.compute.internal", in group "master-us-east-2a.masters.prageethw.co.k8s.local" (this may take a while).
I0121 00:49:09.455796   11370 instancegroups.go:198] waiting for 5m0s after terminating instance
I0121 00:54:09.451323   11370 instancegroups.go:209] Validating the cluster.
I0121 00:54:15.928802   11370 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: kube-system pod "aws-iam-authenticator-drknr" is not healthy

iam auth pod log

time="2019-01-20T16:46:05Z" level=info msg="mapping IAM role" groups="[system:masters]" role="arn:aws:iam::326444312331:role/KubernetesAdmin" username=kubernetes-admin
time="2019-01-20T16:46:07Z" level=info msg="generated a new private key and certificate" certBytes=810 keyBytes=1192
time="2019-01-20T16:46:07Z" level=info msg="saving new key and certificate" certPath=/var/aws-iam-authenticator/cert.pem keyPath=/var/aws-iam-authenticator/key.pem
time="2019-01-20T16:46:07Z" level=fatal msg="could not load/generate a certificate" error="open /var/aws-iam-authenticator/cert.pem: permission denied"

kops version

Version 1.11.0

k8s version

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-13T23:15:13Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

@phillipj
Copy link
Author

Thanks a lot for confirming it's not only me! 😄

I'm still awaiting further rollout to production clusters because of this. Fully understand the original contributor doesn't have time to dig further into this, at the same time I don't know who else to ping..

@prageethw
Copy link

prageethw commented Jan 21, 2019

Hi @chrislovecnm @justinsb, sorry to bother you guys on this, as @rdrgmnzs seems to busy or away is this something you can help with please, looked up your names from git repo.

@prageethw
Copy link

prageethw commented Jan 21, 2019

@phillipj I think I kinder found a workaround, seems working so far but this has increased fresh cluster creation delay at least by 20 mins. I think we need a permanent solution though

  # apply config for iam authenticator
    kubectl apply -f iam-config-map.yaml

edit kops manifest and save with

  authentication:
    aws: {}
  authorization:
    rbac: {}

then apply below commands.

    kops update cluster $NAME --yes
    kops rolling-update cluster ${NAME} --instance-group-roles=Master  --cloudonly --force --yes
    kops validate cluster

results were
kubectl get pods -n kube-system |grep iam

aws-iam-authenticator-4dtsh                                            1/1     Running   0          21m
aws-iam-authenticator-6zwt7                                            1/1     Running   0          30m
aws-iam-authenticator-rhr46                                            1/1     Running   0          26m

@rdrgmnzs
Copy link
Contributor

rdrgmnzs commented Jan 23, 2019

Hi guys, I'm still working on this while also dealing with work and real life. Unfortunately I still have not been able to replicate this issue on either new or existing clusters. If you are able to share pastebins of your kops-configuration logs, protokube logs, kubelet logs and kops configs may help lead me to being able to identify what is causing the issue here. If any of you is able to identify what is causing the issue PRs are always welcome as well.

In the meantime if this is a blocker for you, remember that you can still manually deploy the authenticator without relying on kops to do so. Chris Hein has a great blog post on host to do so here

@rdrgmnzs
Copy link
Contributor

@phillipj pulling it in two different places is doing purposefully, that part of the code is not generating anything and only pulling the certs from S3. Even if the code was in the same location, once adding the task to the queue there is no guarantee that they would be pulled in sequence. These tasks are also completed before the kubelet is started and therefore the sequence they are executed in does not matter.

@phillipj
Copy link
Author

Thanks a lot for taking the time to answer. I've got the same OSS/work/real life challenges, so I respect that deeply, sorry if I've stressed you!

Good reminder it's still possible to get the authenticator setup up-n-running manually though. I've done that a couple times before without too much pain.

Here's the kops/cluster configuration for the last cluster I tried, but ended up rolling back again: config.yaml. The other clusters I've tried are more or less identical when it comes to kops and how they're structured.

Full disclosure; within the next couple of weeks, I probably won't be able to give this a shot to get more logs to you. The kops configuration

@kplimack
Copy link

kplimack commented Feb 23, 2019

I'm also experiencing this

Feb 23 01:25:11 ip-10-78-1-43 nodeup[754]: W0223 01:25:11.489457     754 main.go:142] got error running nodeup (will retry in 30s): error building loader: certificate "aws-iam-authenticator" not found
aws s3 ls s3://state.derp/derp/pki/issued/
                           PRE apiserver-aggregator-ca/
                           PRE apiserver-aggregator/
                           PRE apiserver-proxy-client/
                           PRE ca/
                           PRE kops/
                           PRE kube-controller-manager/
                           PRE kube-proxy/
                           PRE kube-scheduler/
                           PRE kubecfg/
                           PRE kubelet-api/
                           PRE kubelet/
                           PRE master/
 kubectl logs aws-iam-authenticator-srpsz
time="2019-02-23T01:05:07Z" level=info msg="mapping IAM role" groups="[system:masters]" role="arn:aws:iam::123123123:role/eng-opseng" username="aws:admin:{{AccountID}}/{{SessionName}}"
time="2019-02-23T01:05:09Z" level=info msg="generated a new private key and certificate" certBytes=811 keyBytes=1193
time="2019-02-23T01:05:09Z" level=info msg="saving new key and certificate" certPath=/var/aws-iam-authenticator/cert.pem keyPath=/var/aws-iam-authenticator/key.pem
time="2019-02-23T01:05:09Z" level=fatal msg="could not load/generate a certificate" error="open /var/aws-iam-authenticator/cert.pem: permission denied"

kops Version 1.11.0
kube version v1.11.6

@xrstf
Copy link

xrstf commented Feb 27, 2019

Ran into the same issue with Kops 1.11 and k8s 1.11.6 and instead of just doing a rolling-update to all master nodes, I had to run kops update cluster --yes like @prageethw found out, which told me

I0227 22:10:53.545210   18483 executor.go:103] Tasks: 0 done / 81 total; 35 can run
I0227 22:10:55.053516   18483 executor.go:103] Tasks: 35 done / 81 total; 26 can run
I0227 22:10:55.551559   18483 vfs_castore.go:736] Issuing new certificate: "aws-iam-authenticator"
I0227 22:10:55.821137   18483 executor.go:103] Tasks: 61 done / 81 total; 18 can run
I0227 22:10:57.059139   18483 executor.go:103] Tasks: 79 done / 81 total; 2 can run
I0227 22:10:57.193209   18483 executor.go:103] Tasks: 81 done / 81 total; 0 can run

Afterwards the nodes came up and the authenticator was running as expected.

@kplimack
Copy link

@xrstf if you tail the iam-authenticator pod logs, are they actually happy? when i tried the same thing the certs were generated with the wrong file mode and the pods couldnt read them

@xrstf
Copy link

xrstf commented Feb 28, 2019

At the point when the masters did not join the cluster I could not tail the logs because I had no SSH access to the masters and was more or less blind. It was while I was replacing Kops' SSH key (as per https://github.com/kubernetes/kops/blob/master/docs/security.md#ssh-access) that I accidentally ran the kops update cluster command, saw the output and then suddenly could roll-update the masters and have them working.

@flands
Copy link
Contributor

flands commented Mar 4, 2019

Same issue and the workaround appears to be running update --yes (you can even do this after rolling-update cluster --instance-group-roles=Master --force --yes and it fails). I am running Kops 1.11 and k8s 1.11.6.

flands added a commit to flands/kops that referenced this issue Mar 4, 2019
@kplimack
Copy link

kplimack commented Mar 5, 2019

I just brought up a fresh cluster and gave that a go, the rolling update is failing on unhappy authenticator pods

I0305 12:04:45.289847   49093 vfs_castore.go:736] Issuing new certificate: "aws-iam-authenticator"
I0305 12:16:16.947263   49101 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: kube-system pod "aws-iam-authenticator-4sfdc" is not healthy.
I0305 12:16:47.360898   49101 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: kube-system pod "aws-iam-authenticator-4sfdc" is not healthy.
E0305 12:17:16.282664   49101 instancegroups.go:214] Cluster did not validate within 5m0s
  Warning  FailedMount  3m58s (x19 over 26m)  kubelet, ip-10-78-33-168.us-east-2.compute.internal  MountVolume.SetUp failed for volume "config" : configmaps "aws-iam-authenticator" not found
 kubectl get cm --all-namespaces
NAMESPACE       NAME                                 DATA   AGE
kube-system     calico-config                        3      19h
kube-system     extension-apiserver-authentication   6      19h
kube-system     kube-dns-autoscaler                  1      19h
sysdig-agents   sysdig-agent                         1      1h

@rdrgmnzs
Copy link
Contributor

rdrgmnzs commented Mar 5, 2019

@kplimack in your case it looks like the issue is that the config map with the Authenticator config has not yet been created. Please create it following the Docs at https://github.com/kubernetes/kops/blob/master/docs/authentication.md

Looking at the wording in the docs it says to create the config after a cluster rotation, it should actually be created the config before the cluster rotation. That is my mistake and I’ll get it fixed.

@flands
Copy link
Contributor

flands commented Mar 5, 2019

From testing, the configmap can be added after, but the rolling-update needs the --cloudonly flag to ignore validation.

@kplimack
Copy link

kplimack commented Mar 5, 2019

@rdrgmnzs I created the configMap and ran another rolling update with --cloudonly and everything came up happy this time. Thanks for your help!

@rdrgmnzs
Copy link
Contributor

Is anyone still having issues with this? If so can you try to follow the updated documentation and let me know if you still see issues.

@tonymills
Copy link

Hi,
im running kops 1.14-alpha2 and bringing up a 1.14 cluster, i followed the doc to the letter and without the cloudonly flag it wont validate and doesnt work.

maybe the doc needs the --cloudonly flag put in it

  1. apply a configmap
  2. kops edit cluster and add authentication: aws: {}
  3. kops rolling-update cluster ${NAME} --instance-group-roles=Master --force --yes --cloudonly

@flands
Copy link
Contributor

flands commented May 23, 2019

Hmm, this was fixed: #6575, but then changed again: #6701. The --cloudonly option is required.

@rdrgmnzs
Copy link
Contributor

The --cloudonly flag should not be used for this, if it is required there is most likely a misconfiguration of the configmap which is preventing the authenticator from starting up properly or some other issue with the cluster.

@tonymills I got a few questions for you so I can help debug this and to check if there are any code changes required:

  1. When you did a Kops edit did you also add authorization: rbac: {} or was that already there?
  2. Did you also perform a kops update cluster ${CLUSTER_NAME} --yes after adding the authenticator?
  3. Can you do a describe of the authenticator pods & see if there are any errors starting? If they are starting can you please check their logs for any errors?

@flands
Copy link
Contributor

flands commented May 24, 2019

Did something change between the previous version and this version? The issue in the past was enabling the authenticator resulted in master nodes not coming up healthy and thus in a multi-master configuration you could never validate and complete the upgrade.

@rdrgmnzs
Copy link
Contributor

@flands no, mostly just a clarification of the process to turn it on.

The issue a lot of folks were seeing before is that they were turning on AWS authenticator without applying the configmap first. The issue there is that the aws-iam-authenticator lives in the kube-system namespaces and requires the configmap to startup and be in a "Running" state. Kops does a check on the clusters to ensure all PODs in the kube-system namespace are in a "Running" state when rolling the cluster and because the configmap was missing aws-iam-authenticator would not startup properly and would prevent the Kops health check from passing.

@phillipj
Copy link
Author

Sorry I haven't had the chance to try the updated procedure, been busy with other tasks lately.

The issue a lot of folks were seeing before is that they were turning on AWS authenticator without applying the configmap first.

Don't hesitate that has been the main issue for many, though for the record I'm certain the configmap has existed in all the clusters I've tried before deploying the authenticator with kops.

Ref your earlier comment, are you hinting that rbac has to be enabled first, before the authenticator is enabled as a separate step?

@flands
Copy link
Contributor

flands commented May 24, 2019

Hmm when I get the chance I can try again, but definitely had the configmap beforehand.

@jaygorrell
Copy link

jaygorrell commented Jun 10, 2019

I'm hitting a similar issue. What I experienced was this:

  • Add authentication: aws{} to existing cluster
  • Daemonset was created automatically (no update cluster or rolling-update)
  • Pods failed to create due to no configmap
  • Configmap was created; pods started crashing due to cert problems
  • Rolling update on masters performed
  • Pods became healthy

At this point, I'm getting the following message, though:

time="2019-06-10T17:46:40Z" level=info msg="reconfigure your apiserver with `--authentication-token-webhook-config-file=/etc/kubernetes/heptio-authenticator-aws/kubeconfig.yaml` to enable (assuming default hostPath mounts)"

And I confirmed the flag is not passed to the api-server pod.

@rory-ye-nv
Copy link

same problem here.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 5, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests