Deploying External Cloud Provider with Helm Controller - Chicken:Egg problem #1807

jgreat · 2020-05-18T21:07:10Z

Version:
v1.18.2+k3s1

K3s arguments:

provider_id="$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)/$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"

k3s server \
  --disable-cloud-controller \
  --disable servicelb \
  --disable local-storage \
  --disable traefik \
  --datastore-endpoint=${k3s_db_endpoint} \
  --token="${k3s_server_token}" \
  --agent-token="${k3s_agent_token}" \
  --node-name="$(hostname -f)" \
  --kubelet-arg="cloud-provider=external" \
  --kubelet-arg="provider-id=aws:///$provider_id" \
  --write-kubeconfig-mode=644 \
  --tls-san=${elb_dns}

Describe the bug
I would like to install the AWS CCM (external cloud provider) with a helm chart and the helm chart controller by placing a helm manifest in /var/lib/rancher/k3s/server/manifests but helm chart controller pod won't initialize on a new cluster because all the nodes are tainted with node.cloudprovider.kubernetes.io/uninitialized: true. So I need a cloud-provider to install my cloud-provider.

I can remove the taint on one of the nodes, but then that node doesn't get properly initialized.

Is there a way to modify the helm chart controller with a tolerance so it can run on a uninitialized node?

To Reproduce
Launch a new cluster with an external cloud provider and try to install an external cloud provider with a helm chart.

/var/lib/rancher/k3s/server/manifests/00-aws-ccm.yaml

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: aws-cloud-controller-manager
  namespace: kube-system
spec:
  chart: aws-cloud-controller-manager
  repo: http://charts.jgreat.me
  version: 0.0.0-20200508.T071542
  targetNamespace: kube-system

Expected behavior
Helm controller would launch included chart.

Actual behavior
Helm controller pods didn't tolerate node.cloudprovider.kubernetes.io/uninitialized: true taint.

Additional context / logs

➜ kubectl -n kube-system describe pods helm-install-aws-cloud-controller-manager-nd5cm
Name:         helm-install-aws-cloud-controller-manager-nd5cm
Namespace:    kube-system
Priority:     0
Node:         ip-172-20-61-193.us-east-2.compute.internal/172.20.61.193
Start Time:   Mon, 18 May 2020 15:51:20 -0500
Labels:       controller-uid=0e7af8c6-7bab-4a40-bb49-f97f782a1dd1
              helmcharts.helm.cattle.io/chart=aws-cloud-controller-manager
              job-name=helm-install-aws-cloud-controller-manager
Annotations:  <none>
Status:       Running
IP:           10.42.1.2
IPs:
  IP:           10.42.1.2
Controlled By:  Job/helm-install-aws-cloud-controller-manager
Containers:
  helm:
    Container ID:  containerd://25af9567c5dc8692f6f812530b2dfcb2602e57600bfcee72284e23b9a8adb9e6
    Image:         rancher/klipper-helm:v0.2.5
    Image ID:      docker.io/rancher/klipper-helm@sha256:b694f931ffb70c4e0b6aedf69171936cad98e79a5df49372f0e553d7d610062d
    Port:          <none>
    Host Port:     <none>
    Args:
      install
      --namespace
      kube-system
      --repo
      http://charts.jgreat.me
      --version
      0.0.0-20200508.T071542
    State:          Running
      Started:      Mon, 18 May 2020 15:51:24 -0500
    Ready:          True
    Restart Count:  0
    Environment:
      NAME:             aws-cloud-controller-manager
      VERSION:          0.0.0-20200508.T071542
      REPO:             http://charts.jgreat.me
      VALUES_HASH:      e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
      HELM_DRIVER:      secret
      CHART_NAMESPACE:  kube-system
      CHART:            aws-cloud-controller-manager
      HELM_VERSION:     
      NO_PROXY:         ,10.42.0.0/16,10.43.0.0/16
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from helm-aws-cloud-controller-manager-token-rcgx9 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  helm-aws-cloud-controller-manager-token-rcgx9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  helm-aws-cloud-controller-manager-token-rcgx9
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age        From                                                  Message
  ----     ------            ----       ----                                                  -------
  Warning  FailedScheduling  <unknown>  default-scheduler                                     0/2 nodes are available: 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler                                     0/2 nodes are available: 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.

gz#11440

The text was updated successfully, but these errors were encountered:

brandond · 2020-05-18T21:11:19Z

Are you using the integrated helm controller, or have you deployed your own?

jgreat · 2020-05-18T21:11:58Z

Integrated

brandond · 2020-05-18T21:54:03Z

FWIW, the helm-install pods come from jobs that are generated by the helm controller which lives in a different repo:
https://github.com/rancher/helm-controller/blob/master/pkg/helm/controller.go#L158

I don't see any way to inject tolerations into the job spec. Your best bet at the moment would probably be to simply patch the helm-install pod spec and add an unschedulable toleration.

Either that or you could save yourself some heartache and bypass the helm controller entirely. Use helm template to render the chart locally and then drop the output from that into the k3s manifests directory.

jgreat · 2020-05-19T02:24:17Z

Ok that's good information. Thanks for digging in to that. I was hoping that there was an easy config change on the k3s side, but sounds like the helm controller is creating the pods and would have to be updated to add the tolerance. I'll put in an issue over there and maybe take a crack at a PR.

Can you think of any reason why it would be bad idea to add the tolerance to the helm jobs/pods?

My workaround for the moment is to just use the static manifest, but the bytes start to add up when you put the manifests into cloud-init to bootstrap your instances and I don't think static manifests are a good long term solution to maintaining an app.

I can use helm remotely to apply the chart, but Terraform Cloud and Gitlab's CI/CD won't have direct access to the kube-api endpoints (not exposed to the public internet), so even if I could get the kubeconfig off the cluster, that's out.

I could install helm on my master node and apply from there as part of the instance build, but that seems kinda silly if helm chart functionality is built into k3s. Still might be better than maintaining static manifests 🤷

The goal of all this is to bootstrap k3s with AWS cloud-provider and then manage the rest of the config with Rancher. Unfortunately the rancher agents won't run until the node.cloudprovider taints are removed, so I can't manage the cloud-provider through that interface.

brandond · 2020-05-19T02:37:49Z

You could always stick the rendered manifest on s3 somewhere and then just curl it down from the instance bootstrap script? I agree it's not great, but my experience with the embedded helm controller is that it's kind of fragile and doesn't work well for much more than one-time initial load of fairly simple charts.

cjellick · 2020-07-29T00:05:37Z

We could address this by utilizing the "bootstrap" flag that is now in helm-controller.
Here is the pr that introduced that flag: https://github.com/rancher/helm-controller/pull/48/files
But it needs modified to tolerate the uninitialized taint as well.

Here is an example of its usage:
https://github.com/rancher/rke2/blob/master/manifests/canal.yml#L8

mkoperator · 2020-08-07T05:05:57Z

While not directly related to this issue, you may run into an issue after solving the issue above either with the work around or after the fix. When adding worker nodes that will be added automatically as targets by the ELB, each node must have a valid ProviderID. This can be added to the cluster addition script on each node. Mode info in this issue below: #2083

cjellick · 2020-08-14T18:33:22Z

this got accidentally autoclosed. reopening

rancher-max · 2020-08-17T19:58:21Z

Validated using Commit ID: `da3e26a624e5b95fca7970d1b209326f9474b073`

The helm-install-aws-cloud-controller-manager pod finishes and is successfully in Completed state. See steps below.
Note the addition now of the bootstrap: true flag in the HelmChart config. We probably need to document this somewhere. The other changes from the mentioned steps above are just for simplification and a single-node install.
Also, after creating the Instance in AWS, need to attach an IAM role to it that is valid for using cloud providers and add the Tag: kubernetes.io/cluster/<clusterId> = <owned | shared>. Where clusterId is either given from Rancher if you imported this cluster into there or can be anything you desire. (I chose the tag kubernetes.io/cluster/maxk3s=owned as an example)

provider_id="$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)/$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"

curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=da3e26a624e5b95fca7970d1b209326f9474b073 INSTALL_K3S_EXEC="server \
  --disable-cloud-controller \
  --disable servicelb \
  --disable traefik \
  --node-name="$(hostname -f)" \
  --kubelet-arg="cloud-provider=external" \
  --kubelet-arg="provider-id=aws:///$provider_id" \
  --write-kubeconfig-mode=644" sh -

sudo bash -c "cat > /var/lib/rancher/k3s/server/manifests/00-aws-ccm.yaml << EOF
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: aws-cloud-controller-manager
  namespace: kube-system
spec:
  chart: aws-cloud-controller-manager
  repo: http://charts.jgreat.me
  version: 0.0.0-20200508.T071542
  targetNamespace: kube-system
  bootstrap: true
EOF
"

If you don't do the necessary cloudprovider steps on the AWS instance itself, the issue itself is still fixed but you will notice a CrashLoopBackoff on the resulting aws-cloud-controller-manager pod and the other pods will be stuck in Pending until it is resolved.

After doing all steps mentioned above, I validated the cloud-provider is working as expected by deploying the following and validating an ELB was created in my AWS region with the tags kubernetes.io/service-name=default/hello and kubernetes.io/cluster/maxk3s=owned

kind: Service
apiVersion: v1
metadata:
  name: hello
spec:
  type: LoadBalancer
  selector:
    app: hello
  ports:
    - name: http
      protocol: TCP
      # ELB's port
      port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
        - name: hello
          image: nginx

cjellick · 2020-08-17T22:28:57Z

@davidnuzik the Milestone on this issue needs attention. This is currently only fixed on master, which means it would only be in 1.19, 1.18.

rancher-max · 2020-08-18T18:03:55Z

Potentially need to verify with different releases once available

davidnuzik · 2020-08-25T16:31:51Z

Yes - see #2140 to track the backport into v1.18 release branch.

davidnuzik added this to the v1.19 - Backlog milestone May 18, 2020

davidnuzik added [zube]: To Triage and removed [zube]: To Triage labels May 18, 2020

mkoperator mentioned this issue Jul 29, 2020

add toleration for uninitialized k3s-io/helm-controller#57

Merged

cjellick modified the milestones: v1.19 - Backlog, v1.18 - August Jul 30, 2020

cjellick assigned brandond Jul 30, 2020

brandond closed this as completed in k3s-io/helm-controller#57 Aug 7, 2020

zube bot added [zube]: Done and removed [zube]: Backlog labels Aug 7, 2020

brandond mentioned this issue Aug 7, 2020

Update helm-controller #2104

Merged

cjellick reopened this Aug 14, 2020

zube bot added [zube]: To Triage and removed [zube]: Done labels Aug 14, 2020

cjellick mentioned this issue Aug 14, 2020

K3s reapplies "master" node-role label to single node cluster on server restart. #2124

Closed

davidnuzik added [zube]: To Test and removed [zube]: To Triage labels Aug 14, 2020

davidnuzik unassigned brandond Aug 14, 2020

zube bot assigned brandond Aug 14, 2020

davidnuzik assigned rancher-max and unassigned brandond Aug 14, 2020

rancher-max closed this as completed Aug 17, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Aug 17, 2020

cjellick reopened this Aug 17, 2020

zube bot added [zube]: To Triage and removed [zube]: Done labels Aug 17, 2020

zube bot closed this as completed Aug 17, 2020

zube bot added [zube]: Done and removed [zube]: To Triage labels Aug 17, 2020

davidnuzik modified the milestones: v1.18 - August, v1.19 - September Aug 17, 2020

davidnuzik mentioned this issue Aug 18, 2020

Backport 1807 - Deploying External Cloud Provider with Helm Controller - Chicken:Egg problem #2140

Closed

rancher-max mentioned this issue Aug 26, 2020

Add cloudprovider support rancher/rke2#142

Closed

davidnuzik removed the [zube]: Done label Sep 23, 2020

rancher-max mentioned this issue Nov 9, 2020

Unnecessary warnings in k3s server log #2471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying External Cloud Provider with Helm Controller - Chicken:Egg problem #1807

Deploying External Cloud Provider with Helm Controller - Chicken:Egg problem #1807

jgreat commented May 18, 2020 •

edited by jambajaar

Loading

brandond commented May 18, 2020

jgreat commented May 18, 2020

brandond commented May 18, 2020 •

edited

Loading

jgreat commented May 19, 2020

brandond commented May 19, 2020

cjellick commented Jul 29, 2020

mkoperator commented Aug 7, 2020

cjellick commented Aug 14, 2020

rancher-max commented Aug 17, 2020

cjellick commented Aug 17, 2020

rancher-max commented Aug 18, 2020

davidnuzik commented Aug 25, 2020

Deploying External Cloud Provider with Helm Controller - Chicken:Egg problem #1807

Deploying External Cloud Provider with Helm Controller - Chicken:Egg problem #1807

Comments

jgreat commented May 18, 2020 • edited by jambajaar Loading

brandond commented May 18, 2020

jgreat commented May 18, 2020

brandond commented May 18, 2020 • edited Loading

jgreat commented May 19, 2020

brandond commented May 19, 2020

cjellick commented Jul 29, 2020

mkoperator commented Aug 7, 2020

cjellick commented Aug 14, 2020

rancher-max commented Aug 17, 2020

Validated using Commit ID: da3e26a624e5b95fca7970d1b209326f9474b073

cjellick commented Aug 17, 2020

rancher-max commented Aug 18, 2020

davidnuzik commented Aug 25, 2020

jgreat commented May 18, 2020 •

edited by jambajaar

Loading

brandond commented May 18, 2020 •

edited

Loading

Validated using Commit ID: `da3e26a624e5b95fca7970d1b209326f9474b073`