Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws-eks] k8s resources cannot be updated with EndpointAccess.PRIVATE. #10036

Closed
chiyiliao opened this issue Aug 28, 2020 · 10 comments
Closed

[aws-eks] k8s resources cannot be updated with EndpointAccess.PRIVATE. #10036

chiyiliao opened this issue Aug 28, 2020 · 10 comments
Assignees
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.

Comments

@chiyiliao
Copy link

k8s resources create by add_manifest() was not updated after CDK code updated.
It only occurred when EndpointAccess set to PRIVATE.

Reproduction Steps

  1. Create cluster as below first (VPC without NAT):

     cluster = eks.Cluster(self, "test-eks",
         version=eks.KubernetesVersion.V1_16,
         endpoint_access=eks.EndpointAccess.PRIVATE,
         security_group=sg, 
         vpc=vpc, 
         vpc_subnets=[subnet_selection])
    
     cluster.add_manifest("role",
         {   
             "apiVersion": "rbac.authorization.k8s.io/v1",
             "kind": "ClusterRole",
             "metadata": {
                 "name": "test-role",
                 "namespace": "default"
             },
             "rules": [
                 {   
                     "apiGroups": [ "" ],
                     "resources": [ "pods", "jobs"],
                     "verbs": [ "get", "list", "watch", "delete"]
                 }, {
                     "apiGroups": [ "batch" ],
                     "resources": [ "pods", "jobs"],
                     "verbs": [ "get", "list", "watch", "delete"]
                 }
             ]
         })
    
  2. Deploy it

  3. Modify add_manifest to below (just remove "delete" verbs from rules):

     cluster.add_manifest("role",
         {   
             "apiVersion": "rbac.authorization.k8s.io/v1",
             "kind": "ClusterRole",
             "metadata": {
                 "name": "test-role",
                 "namespace": "default"
             },
             "rules": [
                 {   
                     "apiGroups": [ "" ],
                     "resources": [ "pods", "jobs"],
                     "verbs": [ "get", "list", "watch"]
                 }, {
                     "apiGroups": [ "batch" ],
                     "resources": [ "pods", "jobs"],
                     "verbs": [ "get", "list", "watch"]
                 }
             ]
         })
    
  4. Deploy it

What did you expect to happen?

"kubectl get ClusterRole test-role -o yaml" command should return verbs without "delete"

What actually happened?

"kubectl get ClusterRole test-role -o yaml" command return verbs with "delete"

But "delete" does not exist in cloudformation, it seems like cloudformation updated successful, but the action of kubectl was not successful.

Environment

  • **CLI Version : 1.59.0
  • **Framework Version: 1.59.0
  • **Node.js Version: v12.14.0
  • **OS : Darwin 18.7.0
  • **Language (Version): Python 3.7.5

Other

It seems like lambda function xxxxxxx-awscdkawseksKubectlPr-Handlerxxxxxxx do not run in VPC environment,
cause the lambda function could not connect to API server endpoint.


This is 🐛 Bug Report

@chiyiliao chiyiliao added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 28, 2020
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Aug 28, 2020
@iliapolo iliapolo added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Aug 29, 2020
@iliapolo
Copy link
Contributor

Hi @chiyiliao - Thanks for reaching out, unfortunately I am unable to reproduce this.

You mentioned:

It seems like lambda function xxxxxxx-awscdkawseksKubectlPr-Handlerxxxxxxx do not run in VPC environment,
cause the lambda function could not connect to API server endpoint.

Are you sure the lambda isn't connected to the VPC? We have added explicit logic to connect the handler to the VPC in case EndpointAccess.PRIVATE is used.

Also, if indeed the handler wasn't connected and couldn't access the cluster endpoint, your initial deployment should have failed as well with a timeout error, as well as any additional ones.

Can you share exactly how are you running kubectl get? If the endpoint is private, are you running it inside a lambda of some sort in the VPC?

Could you please attach the handler logs? You can find them if you navigate to the lambda with a description of onEvent handler for EKS kubectl resource provider on the console and click on the View Logs in CloudWatch in the Monitoring tab.

Thanks!

@iliapolo iliapolo added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Aug 30, 2020
@SomayaB SomayaB removed the needs-triage This issue or PR still needs to be triaged. label Aug 31, 2020
@chiyiliao
Copy link
Author

I might confused some steps and cdk versions during my testing.
You are right, the initial deployment was failed with latest cdk version, I can found below event in cloudformation console:

Failed to create resource. Error: Command '['aws', 'eks', 'update-kubeconfig', '--role-arn', 'arn:aws:iam::xxxxxxxxxxxxxxx:role/test-eks', '--name', 'test-eks', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255. Logs: /aws/lambda/test-eks-awscdkawseksKubectlPr-Handler886CB40B-23193CKFEATKS at invokeUserFunction (/var/task/framework.js:95:19) at process._tickCallback (internal/process/next_tick.js:68:7)

I made an another testing.
I try to create the stack with EndpointAccess.PUBLIC_AND_PRIVATE, and then modify the code to use EndpointAccess.PRIVATE and update it.
After the stack updated, I change the metadata of the ClusterRole resource (remove 'delete' verb), and then update the stack again, it occur below error:

9:43:45 | UPDATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | test-eks...g/Resource/Default
Failed to update resource. Error: Command '['aws', 'eks', 'update-kubeconfig',
9:43:45 | UPDATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | dev01e
ksenvcluster...olebinding6E92C2AA
Failed to update resource. Error: Command '['aws', 'eks', 'update-kubeconfig',
'--role-arn', 'arn:aws:iam::xxxxxxxxxxxx:role/test-eks-CreationRole', '--na
me', 'test-eks', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.

There is no NAT in the VPC environment, so lambda function cannot connect to internet IP, is this limitation cause the problem?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Sep 1, 2020
@iliapolo
Copy link
Contributor

iliapolo commented Sep 1, 2020

@chiyiliao Just to make I understand, you saying that:

  1. When you deploy the cluster with PRIVATE - it fails.
  2. When you deploy the cluster with PUBLIC_AND_PRIVATE - it works.
  3. When you update the cluster to use PRIVATE - it fails again.

Is that correct?

@iliapolo iliapolo added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Sep 1, 2020
@chiyiliao
Copy link
Author

My scenario is as below:

  1. When I deploy the cluster with PRIVATE - it fails.
  2. When I deploy the cluster with PUBLIC_AND_PRIVATE - it works.
  3. When I just update the cluster to use PRIVATE - the stack update successful.
  4. When I change the manifest of ClusterRole - it fails.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Sep 2, 2020
@iliapolo
Copy link
Contributor

iliapolo commented Sep 2, 2020

@chiyiliao This definitely seems to be related to the fact that your VPC doesn't have a NAT since the KubectlProvider lambda function that runs kubectl requires internet to access the EKS service API.

However that doesn't explain how this works when you use PUBLIC_AND_PRIVATE, so i'm still not sure exactly what's happening, but i'm investigating.

Is your VPC created with CDK? mind sharing the code you use to create it? are you using a VpcEndpoint?

@iliapolo iliapolo added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Sep 2, 2020
@chiyiliao
Copy link
Author

No, my VPC was not created by CDK.
Yes, there are some VpcEndpoint in the VPC, but no EKS VPCEndpoint since it was not available in AWS.

@iliapolo iliapolo added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Sep 2, 2020
@include
Copy link

include commented Nov 29, 2020

Hi,

I am having weird behavior as well; but reading the comment above... is it mandatory to have a NAT?
My cluster must be endpointAccess: eks.EndpointAccess.PRIVATE and I have tagged my subnets with internal-elb only.
My network architecture uses a transit gateway where all traffic is sent (to a distinct shared account where all traffic from all my accounts is filtered etc).

Cheers,
F

@iliapolo iliapolo added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Dec 29, 2020
@iliapolo
Copy link
Contributor

To those interested in this issue, please have a look at this issue i've created that explains about the internet requirements of the cluster VPC. If needed, lets continue the discussion over there.

This issue will be closed soon.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 30, 2020
@iliapolo iliapolo added guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed bug This issue is a bug. labels Dec 31, 2020
@github-actions
Copy link

github-actions bot commented Jan 6, 2021

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Jan 6, 2021
@solovievv
Copy link

solovievv commented Apr 28, 2021

More detailed description of the issue.

It's not possible to define kubectl subnets and security group directly in eks constructor. However, custom resource lambda function requires the subnets and security group:
https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-eks/lib/kubectl-provider.ts

    const handler = new lambda.Function(this, 'Handler', {
      code: lambda.Code.fromAsset(path.join(__dirname, 'kubectl-handler')),
      runtime: lambda.Runtime.PYTHON_3_7,
      handler: 'index.handler',
      timeout: Duration.minutes(15),
      description: 'onEvent handler for EKS kubectl resource provider',
      memorySize,
      environment: cluster.kubectlEnvironment,

      // defined only when using private access
      vpc: cluster.kubectlPrivateSubnets ? cluster.vpc : undefined,
      securityGroups: cluster.kubectlSecurityGroup ? [cluster.kubectlSecurityGroup] : undefined,
      vpcSubnets: cluster.kubectlPrivateSubnets ? { subnets: cluster.kubectlPrivateSubnets } : undefined,
    });

Solution.

Kubectl subnets and security group can be defined using eks.Cluster.from_cluster_attributes(...):

    eks_cluster = eks.Cluster.from_cluster_attributes(
        self, "eks-cluster",
        cluster_name=eks_cluster_name,
        vpc=vpc,
        kubectl_role_arn=settings['role_eks_cluster'].role_arn,
        kubectl_security_group_id=sg_eks_id,
        kubectl_private_subnet_ids=subnets_ids
    )
    eks_cluster.add_manifest(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.
Projects
None yet
Development

No branches or pull requests

5 participants