-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS authenticator stops master from joining cluster #6154
Comments
Any thoughts on what I might be missing before starting the rolling update would be much appreciated! As far as I've understood, it's not strictly required to pre-create a certificate, but this error might suggest otherwise? |
While digging through this, I assume I found the origin of certificate, err := b.NodeupModelContext.KeyStore.FindCert(id)
if err != nil {
return fmt.Errorf("error fetching %q certificate from keystore: %v", id, err)
}
if certificate == nil {
return fmt.Errorf("certificate %q not found", id)
} Should there be a corresponding "create certificate" method invoked or is that handled somewhere else programatically (or manually)? |
Ohmy I just realised I've completely missed the Kops usage section in aws-iam-authenticator. There's certainly mention of a couple of certificates there that I'll try to create. I'll close this issue tomorrow if it turns out to be the fix I've been looking for. |
I'm pretty sure back in September, 2018 I had this working fine with just adding
section to my cluster manifest. |
Also, kops doc also doesn't say you have to do anything else other than adding |
@phillipj as @qlikcoe mentioned you should not need to create those certs they should automatically be created for you. The only thing you'd need to create is the aws-iam-authenticator configuration. Just to make sure we're covering everything did you also add the required rbac stanza to the kops config as describe in https://github.com/kubernetes/kops/blob/master/docs/authentication.md#aws-iam-authenticator?
|
Thanks for much needed input guys! @rdrgmnzs no I have not enabled RBAC yet. I was hoping to do that as the next step after getting authorisation in place across all our clusters. I'll give that a shot within a couple of hours and see if that has any effect on getting this authenticator on its feet. |
Sadly no difference as far as I can see. The master node's nodeup sequence stops because of the same error:
Can also confirm @rdrgmnzs mind sharing where/what is supposed to create certificate, as I haven't found any hints of that in kops? I tried manually applying ./upup/models/cloudup/resources/addons/authentication.aws/k8s-1.10.yaml directly to see if the automatic certificate creation inside aws-iam-authenticator would be the answer, but it failed because the process did not have permissions to write:
|
@phillipj they get created at And loaded at https://github.com/rdrgmnzs/kops/blob/master/nodeup/pkg/model/kube_apiserver.go#L240 and https://github.com/rdrgmnzs/kops/blob/master/nodeup/pkg/model/kube_apiserver.go#L264 I'll try to launch a cluster tomorrow to see if I can reproduce the issue. |
Great, really appreciated! Thanks again for some really useful hints, especially the one in ./pkg/model/pki.go that I haven't stumbled upon yet. |
Realised it's probably worth emphasising I'm adding the authenticator to already existing clusters, not launching new ones where the authenticator is present from the get-go. |
@rdrgmnzs , I do have the So, right now if I'm bootstrapping the cluster from scratch, with aws-iam-authenticator enabled, the cluster does not validate. If I'm enabling this for an existing cluster, all three aws-iam-authenticator pods are failing with this:
|
I just switched back to |
I'm getting closer now... Here's the log
And obviously the configmap is being created after the cluster is up and running, so there is a racing condition. Imported the configmap manually, restarted the pods and they are running fine now. Just one line in the logs is worrying me
|
I also got some new findings while doing more trial-n-error today. I've now got three clusters with the authenticator deployed & working as expected. Got it installed more or less painlessly in two of them, while the third one was a tough nut to crack. After trying rolling update on that cluster's master nodes at least 5+ times, it suddenly just worked. Meaning the certificate error I've described in this issue and had seen for a couple of days, went away and the aws-iam-authenticator pods started showing up and worked on the rolled nodes. The only thing I did differently that last time on the troublesome cluster, was running the rolling-update command with An interesting thing I've observed is that when invoking
on the clusters it works on, the aws-iam-authenticator pods appears immediately when listing running pods in a separate terminal (while the rolling update is still in progress):
If those pods does not appear immediately, it's always due to the certificate error I've described earlier, causing the nodeup procedure to fail thus preventing those master nodes to join the cluster. Since I've experienced this somewhat intermittently, works on some clusters right away, others might need several attempts, could this be an (async) timing issue or similar? I've got yet another four clusters to go before I'm done, so it'll be interesting to see if how often this occurs and if I'm able to find a reproducible fix. In a rolling update procedure, when is ./pkg/model/pki.go executed vs the code that expects those pki's to have been put in place (./nodeup/pkg/model/kube_apiserver.go)? P.S. as a follow-up to @qlikcoe's comment about missing |
Thanks for all the info and debugging. The certs are pulled during the host provisioning process with protokube, which is why I'm baffled by this issue. If you happen to hit this again, could you please take a look at the protokube logs and check if it is failing to pull the certs for some reason. |
Aha, yeah I'll look more into the protokube logs going forward. I've gotten more suspicious about what makes the certificate creation to be kicked off in the right moment though. Cause in the clusters I've had issues with so far, no certificates has been created and put in the S3 kops state bucket before the nodes gets rolled. That would also mean there's nothing to be downloaded for protokube, as long as it isn't responsible for creating by running the ./pkg/model/pki.go procedure as well? |
Ran into this again with a new cluster today. Wanted to get a look inside protokube logs but sadly wasn't able to;
Assuming the nodeup procedure gets errors too early in the process, so protokube isn't even started. Earlier today I was able to fix a troublesome cluster with these steps:
Ended up doing a cloud only update in step number 2 because 2 of 3 master nodes was not in the cluster because of these certification errors at that point. Could therefore not run an ordinary rolling update because On second though, with the first cluster I fixed this issue in, I'm pretty certain I also did a revert & cloud only rolling update to get it on its feet again, just before running rolling-update with @rdrgmnzs is getting hold of the configured cluster authentication data from two different places deliberately done in ./pkg/model/pki.go#L296? Or could this be a source of inconsistent state for some reason? (yes, I'm wildly guessing here) if b.Cluster.Spec.Authentication != nil {
if b.KopsModelContext.Cluster.Spec.Authentication.Aws != nil {
...
}
} |
I've put further work on this on hold until I get some more input on what might be causing this to happen. Still got three production clusters that needs the authenticator. Tried deploying it to two of them, both of them failed with the certification error described in in this issue originally. As I don't want to cause unnecessary service disruptions in production clusters by doing several rolling updates attempts in a row, in hopes it suddenly just works, I've put this on hold as mentioned. @rdrgmnzs any chance you're able to answer my latest question in my comment above? |
@phillipj @rdrgmnzs I can confirm I'm seeing the same issue my kops config is as below authentication:
aws: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws kops logs WARNING: Ignoring DaemonSet-managed pods: aws-iam-authenticator-r4stk
pod/dns-controller-547884bc7f-v7ks9 evicted
I0121 00:47:37.517693 11370 instancegroups.go:358] Waiting for 1m30s for pods to stabilize after draining.
I0121 00:49:07.518841 11370 instancegroups.go:185] deleting node "ip-172-20-58-7.us-east-2.compute.internal" from kubernetes
I0121 00:49:07.809931 11370 instancegroups.go:299] Stopping instance "i-0388f700d7cfba86e", node "ip-172-20-58-7.us-east-2.compute.internal", in group "master-us-east-2a.masters.prageethw.co.k8s.local" (this may take a while).
I0121 00:49:09.455796 11370 instancegroups.go:198] waiting for 5m0s after terminating instance
I0121 00:54:09.451323 11370 instancegroups.go:209] Validating the cluster.
I0121 00:54:15.928802 11370 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: kube-system pod "aws-iam-authenticator-drknr" is not healthy iam auth pod log time="2019-01-20T16:46:05Z" level=info msg="mapping IAM role" groups="[system:masters]" role="arn:aws:iam::326444312331:role/KubernetesAdmin" username=kubernetes-admin
time="2019-01-20T16:46:07Z" level=info msg="generated a new private key and certificate" certBytes=810 keyBytes=1192
time="2019-01-20T16:46:07Z" level=info msg="saving new key and certificate" certPath=/var/aws-iam-authenticator/cert.pem keyPath=/var/aws-iam-authenticator/key.pem
time="2019-01-20T16:46:07Z" level=fatal msg="could not load/generate a certificate" error="open /var/aws-iam-authenticator/cert.pem: permission denied" kops version Version 1.11.0
k8s version Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-13T23:15:13Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
|
Thanks a lot for confirming it's not only me! 😄 I'm still awaiting further rollout to production clusters because of this. Fully understand the original contributor doesn't have time to dig further into this, at the same time I don't know who else to ping.. |
Hi @chrislovecnm @justinsb, sorry to bother you guys on this, as @rdrgmnzs seems to busy or away is this something you can help with please, looked up your names from git repo. |
@phillipj I think I kinder found a workaround, seems working so far but this has increased fresh cluster creation delay at least by 20 mins. I think we need a permanent solution though # apply config for iam authenticator
kubectl apply -f iam-config-map.yaml edit kops manifest and save with authentication:
aws: {}
authorization:
rbac: {} then apply below commands. kops update cluster $NAME --yes
kops rolling-update cluster ${NAME} --instance-group-roles=Master --cloudonly --force --yes
kops validate cluster results were aws-iam-authenticator-4dtsh 1/1 Running 0 21m
aws-iam-authenticator-6zwt7 1/1 Running 0 30m
aws-iam-authenticator-rhr46 1/1 Running 0 26m |
Hi guys, I'm still working on this while also dealing with work and real life. Unfortunately I still have not been able to replicate this issue on either new or existing clusters. If you are able to share pastebins of your kops-configuration logs, protokube logs, kubelet logs and kops configs may help lead me to being able to identify what is causing the issue here. If any of you is able to identify what is causing the issue PRs are always welcome as well. In the meantime if this is a blocker for you, remember that you can still manually deploy the authenticator without relying on kops to do so. Chris Hein has a great blog post on host to do so here |
@phillipj pulling it in two different places is doing purposefully, that part of the code is not generating anything and only pulling the certs from S3. Even if the code was in the same location, once adding the task to the queue there is no guarantee that they would be pulled in sequence. These tasks are also completed before the kubelet is started and therefore the sequence they are executed in does not matter. |
Thanks a lot for taking the time to answer. I've got the same OSS/work/real life challenges, so I respect that deeply, sorry if I've stressed you! Good reminder it's still possible to get the authenticator setup up-n-running manually though. I've done that a couple times before without too much pain. Here's the kops/cluster configuration for the last cluster I tried, but ended up rolling back again: config.yaml. The other clusters I've tried are more or less identical when it comes to kops and how they're structured. Full disclosure; within the next couple of weeks, I probably won't be able to give this a shot to get more logs to you. The kops configuration |
I'm also experiencing this
kops Version 1.11.0 |
Ran into the same issue with Kops 1.11 and k8s 1.11.6 and instead of just doing a rolling-update to all master nodes, I had to run
Afterwards the nodes came up and the authenticator was running as expected. |
@xrstf if you tail the iam-authenticator pod logs, are they actually happy? when i tried the same thing the certs were generated with the wrong file mode and the pods couldnt read them |
At the point when the masters did not join the cluster I could not tail the logs because I had no SSH access to the masters and was more or less blind. It was while I was replacing Kops' SSH key (as per https://github.com/kubernetes/kops/blob/master/docs/security.md#ssh-access) that I accidentally ran the |
Same issue and the workaround appears to be running |
I just brought up a fresh cluster and gave that a go, the rolling update is failing on unhappy authenticator pods
|
@kplimack in your case it looks like the issue is that the config map with the Authenticator config has not yet been created. Please create it following the Docs at https://github.com/kubernetes/kops/blob/master/docs/authentication.md Looking at the wording in the docs it says to create the config after a cluster rotation, it should actually be created the config before the cluster rotation. That is my mistake and I’ll get it fixed. |
From testing, the configmap can be added after, but the |
@rdrgmnzs I created the |
Is anyone still having issues with this? If so can you try to follow the updated documentation and let me know if you still see issues. |
Hi, maybe the doc needs the --cloudonly flag put in it
|
The @tonymills I got a few questions for you so I can help debug this and to check if there are any code changes required:
|
Did something change between the previous version and this version? The issue in the past was enabling the authenticator resulted in master nodes not coming up healthy and thus in a multi-master configuration you could never validate and complete the upgrade. |
@flands no, mostly just a clarification of the process to turn it on. The issue a lot of folks were seeing before is that they were turning on AWS authenticator without applying the configmap first. The issue there is that the aws-iam-authenticator lives in the kube-system namespaces and requires the configmap to startup and be in a "Running" state. Kops does a check on the clusters to ensure all PODs in the kube-system namespace are in a "Running" state when rolling the cluster and because the configmap was missing aws-iam-authenticator would not startup properly and would prevent the Kops health check from passing. |
Sorry I haven't had the chance to try the updated procedure, been busy with other tasks lately.
Don't hesitate that has been the main issue for many, though for the record I'm certain the configmap has existed in all the clusters I've tried before deploying the authenticator with kops. Ref your earlier comment, are you hinting that rbac has to be enabled first, before the authenticator is enabled as a separate step? |
Hmm when I get the chance I can try again, but definitely had the configmap beforehand. |
I'm hitting a similar issue. What I experienced was this:
At this point, I'm getting the following message, though:
And I confirmed the flag is not passed to the api-server pod. |
same problem here. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1. What
kops
version are you running? The commandkops version
, will displaythis information.
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
The authentication section added:
as described in: ./docs/authentication.md#aws-iam-authenticator
5. What happened after the commands executed?
kops
drained and stopped the first master node. A new EC2 instance was created by AWS auto scaling group, but that instance was never able to join the cluster. The cluster validation step therefore failed.Seeing this in the master nodes logs:
6. What did you expect to happen?
All the master nodes to be updated with the
aws-iam-authenticator
enabled & ready for action./cc @rdrgmnzs
The text was updated successfully, but these errors were encountered: