Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying External Cloud Provider with Helm Controller - Chicken:Egg problem #1807

Closed
jgreat opened this issue May 18, 2020 · 12 comments · Fixed by k3s-io/helm-controller#57
Closed
Assignees
Milestone

Comments

@jgreat
Copy link

jgreat commented May 18, 2020

Version:
v1.18.2+k3s1

K3s arguments:

provider_id="$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)/$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"

k3s server \
  --disable-cloud-controller \
  --disable servicelb \
  --disable local-storage \
  --disable traefik \
  --datastore-endpoint=${k3s_db_endpoint} \
  --token="${k3s_server_token}" \
  --agent-token="${k3s_agent_token}" \
  --node-name="$(hostname -f)" \
  --kubelet-arg="cloud-provider=external" \
  --kubelet-arg="provider-id=aws:///$provider_id" \
  --write-kubeconfig-mode=644 \
  --tls-san=${elb_dns}

Describe the bug
I would like to install the AWS CCM (external cloud provider) with a helm chart and the helm chart controller by placing a helm manifest in /var/lib/rancher/k3s/server/manifests but helm chart controller pod won't initialize on a new cluster because all the nodes are tainted with node.cloudprovider.kubernetes.io/uninitialized: true. So I need a cloud-provider to install my cloud-provider.

I can remove the taint on one of the nodes, but then that node doesn't get properly initialized.

Is there a way to modify the helm chart controller with a tolerance so it can run on a uninitialized node?

To Reproduce
Launch a new cluster with an external cloud provider and try to install an external cloud provider with a helm chart.

/var/lib/rancher/k3s/server/manifests/00-aws-ccm.yaml

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: aws-cloud-controller-manager
  namespace: kube-system
spec:
  chart: aws-cloud-controller-manager
  repo: http://charts.jgreat.me
  version: 0.0.0-20200508.T071542
  targetNamespace: kube-system

Expected behavior
Helm controller would launch included chart.

Actual behavior
Helm controller pods didn't tolerate node.cloudprovider.kubernetes.io/uninitialized: true taint.

Additional context / logs

➜ kubectl -n kube-system describe pods helm-install-aws-cloud-controller-manager-nd5cm
Name:         helm-install-aws-cloud-controller-manager-nd5cm
Namespace:    kube-system
Priority:     0
Node:         ip-172-20-61-193.us-east-2.compute.internal/172.20.61.193
Start Time:   Mon, 18 May 2020 15:51:20 -0500
Labels:       controller-uid=0e7af8c6-7bab-4a40-bb49-f97f782a1dd1
              helmcharts.helm.cattle.io/chart=aws-cloud-controller-manager
              job-name=helm-install-aws-cloud-controller-manager
Annotations:  <none>
Status:       Running
IP:           10.42.1.2
IPs:
  IP:           10.42.1.2
Controlled By:  Job/helm-install-aws-cloud-controller-manager
Containers:
  helm:
    Container ID:  containerd://25af9567c5dc8692f6f812530b2dfcb2602e57600bfcee72284e23b9a8adb9e6
    Image:         rancher/klipper-helm:v0.2.5
    Image ID:      docker.io/rancher/klipper-helm@sha256:b694f931ffb70c4e0b6aedf69171936cad98e79a5df49372f0e553d7d610062d
    Port:          <none>
    Host Port:     <none>
    Args:
      install
      --namespace
      kube-system
      --repo
      http://charts.jgreat.me
      --version
      0.0.0-20200508.T071542
    State:          Running
      Started:      Mon, 18 May 2020 15:51:24 -0500
    Ready:          True
    Restart Count:  0
    Environment:
      NAME:             aws-cloud-controller-manager
      VERSION:          0.0.0-20200508.T071542
      REPO:             http://charts.jgreat.me
      VALUES_HASH:      e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
      HELM_DRIVER:      secret
      CHART_NAMESPACE:  kube-system
      CHART:            aws-cloud-controller-manager
      HELM_VERSION:     
      NO_PROXY:         ,10.42.0.0/16,10.43.0.0/16
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from helm-aws-cloud-controller-manager-token-rcgx9 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  helm-aws-cloud-controller-manager-token-rcgx9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  helm-aws-cloud-controller-manager-token-rcgx9
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age        From                                                  Message
  ----     ------            ----       ----                                                  -------
  Warning  FailedScheduling  <unknown>  default-scheduler                                     0/2 nodes are available: 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler                                     0/2 nodes are available: 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.

gz#11440

@brandond
Copy link
Member

Are you using the integrated helm controller, or have you deployed your own?

@jgreat
Copy link
Author

jgreat commented May 18, 2020

Integrated

@brandond
Copy link
Member

brandond commented May 18, 2020

FWIW, the helm-install pods come from jobs that are generated by the helm controller which lives in a different repo:
https://github.com/rancher/helm-controller/blob/master/pkg/helm/controller.go#L158

I don't see any way to inject tolerations into the job spec. Your best bet at the moment would probably be to simply patch the helm-install pod spec and add an unschedulable toleration.

Either that or you could save yourself some heartache and bypass the helm controller entirely. Use helm template to render the chart locally and then drop the output from that into the k3s manifests directory.

@jgreat
Copy link
Author

jgreat commented May 19, 2020

Ok that's good information. Thanks for digging in to that. I was hoping that there was an easy config change on the k3s side, but sounds like the helm controller is creating the pods and would have to be updated to add the tolerance. I'll put in an issue over there and maybe take a crack at a PR.

Can you think of any reason why it would be bad idea to add the tolerance to the helm jobs/pods?

My workaround for the moment is to just use the static manifest, but the bytes start to add up when you put the manifests into cloud-init to bootstrap your instances and I don't think static manifests are a good long term solution to maintaining an app.

I can use helm remotely to apply the chart, but Terraform Cloud and Gitlab's CI/CD won't have direct access to the kube-api endpoints (not exposed to the public internet), so even if I could get the kubeconfig off the cluster, that's out.

I could install helm on my master node and apply from there as part of the instance build, but that seems kinda silly if helm chart functionality is built into k3s. Still might be better than maintaining static manifests 🤷

The goal of all this is to bootstrap k3s with AWS cloud-provider and then manage the rest of the config with Rancher. Unfortunately the rancher agents won't run until the node.cloudprovider taints are removed, so I can't manage the cloud-provider through that interface.

@brandond
Copy link
Member

You could always stick the rendered manifest on s3 somewhere and then just curl it down from the instance bootstrap script? I agree it's not great, but my experience with the embedded helm controller is that it's kind of fragile and doesn't work well for much more than one-time initial load of fairly simple charts.

@cjellick
Copy link
Contributor

We could address this by utilizing the "bootstrap" flag that is now in helm-controller.
Here is the pr that introduced that flag: https://github.com/rancher/helm-controller/pull/48/files
But it needs modified to tolerate the uninitialized taint as well.

Here is an example of its usage:
https://github.com/rancher/rke2/blob/master/manifests/canal.yml#L8

@mkoperator
Copy link

While not directly related to this issue, you may run into an issue after solving the issue above either with the work around or after the fix. When adding worker nodes that will be added automatically as targets by the ELB, each node must have a valid ProviderID. This can be added to the cluster addition script on each node. Mode info in this issue below: #2083

@cjellick
Copy link
Contributor

this got accidentally autoclosed. reopening

@rancher-max
Copy link
Contributor

Validated using Commit ID: da3e26a624e5b95fca7970d1b209326f9474b073

  • The helm-install-aws-cloud-controller-manager pod finishes and is successfully in Completed state. See steps below.
  • Note the addition now of the bootstrap: true flag in the HelmChart config. We probably need to document this somewhere. The other changes from the mentioned steps above are just for simplification and a single-node install.
  • Also, after creating the Instance in AWS, need to attach an IAM role to it that is valid for using cloud providers and add the Tag: kubernetes.io/cluster/<clusterId> = <owned | shared>. Where clusterId is either given from Rancher if you imported this cluster into there or can be anything you desire. (I chose the tag kubernetes.io/cluster/maxk3s=owned as an example)
provider_id="$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)/$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"

curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=da3e26a624e5b95fca7970d1b209326f9474b073 INSTALL_K3S_EXEC="server \
  --disable-cloud-controller \
  --disable servicelb \
  --disable traefik \
  --node-name="$(hostname -f)" \
  --kubelet-arg="cloud-provider=external" \
  --kubelet-arg="provider-id=aws:///$provider_id" \
  --write-kubeconfig-mode=644" sh -

sudo bash -c "cat > /var/lib/rancher/k3s/server/manifests/00-aws-ccm.yaml << EOF
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: aws-cloud-controller-manager
  namespace: kube-system
spec:
  chart: aws-cloud-controller-manager
  repo: http://charts.jgreat.me
  version: 0.0.0-20200508.T071542
  targetNamespace: kube-system
  bootstrap: true
EOF
"

If you don't do the necessary cloudprovider steps on the AWS instance itself, the issue itself is still fixed but you will notice a CrashLoopBackoff on the resulting aws-cloud-controller-manager pod and the other pods will be stuck in Pending until it is resolved.

  • After doing all steps mentioned above, I validated the cloud-provider is working as expected by deploying the following and validating an ELB was created in my AWS region with the tags kubernetes.io/service-name=default/hello and kubernetes.io/cluster/maxk3s=owned
kind: Service
apiVersion: v1
metadata:
  name: hello
spec:
  type: LoadBalancer
  selector:
    app: hello
  ports:
    - name: http
      protocol: TCP
      # ELB's port
      port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
        - name: hello
          image: nginx

@cjellick
Copy link
Contributor

@davidnuzik the Milestone on this issue needs attention. This is currently only fixed on master, which means it would only be in 1.19, 1.18.

@rancher-max
Copy link
Contributor

Potentially need to verify with different releases once available

@davidnuzik
Copy link
Contributor

Yes - see #2140 to track the backport into v1.18 release branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants