Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm deployment on Amazon EKS: values.yml server.serviceaccount.annotations AWS IAM Role not working as expected for KMS auto-unseal #9576

Closed
gargana opened this issue Jul 22, 2020 · 8 comments
Labels
bug Used to indicate a potential bug core/seal

Comments

@gargana
Copy link

gargana commented Jul 22, 2020

Describe the bug
Deploying Vault using the Helm chart and Helm v3
In order to allow vault to access the AWS KMS key required for auto-unseal I provide the service account an IAM role via annotations in the values.yml.
eg:

server:
  serviceAccount:
    annotations: |
       eks.amazonaws.com/role-arn: "arn:aws:iam::8888XXXXXXXX:role/VaultIAMRole"

This appears on the vault pods:

$ kubectl get pods/vault-0 -o yaml. (see below for more complete output)
...
...
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::8888XXXXXXXX:role/VaultIAMRole
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
...
... 

The vault-X pods fail with the following:

$ kubectl logs pod/vault-0
Error parsing Seal configuration: error fetching AWS KMS wrapping key information: 
AccessDeniedException: User:
 arn:aws:sts::8888XXXXXXXX:assumed-role/tCaT-quickstart-amazon-eks-gargan-NodeInstanceRole-XXXXXXXXXXX/i-083XXXXXXXXXXX
 is not authorized to perform: kms:DescribeKey on resource: 
arn:aws:kms:eu-west-1:8888XXXXXXXX:key/e4384c07-5f11-4372-bce0-XXXXXXXX
	status code: 400, request id: 74f9397a-0925-4f98-bf5c-XXXXXXXXXX

What is happening here is that Vault is using the Instance Profile IAM Role and NOT the one provided to the serviceaccount via the values.yaml.

When looking at the code for vault here:

roleARN := os.Getenv("AWS_ROLE_ARN")
tokenPath := os.Getenv("AWS_WEB_IDENTITY_TOKEN_FILE")
sessionName := os.Getenv("AWS_ROLE_SESSION_NAME")
if roleARN != "" && tokenPath != "" {
// this session is only created to create the WebIdentityRoleProvider, as the env variables are already there
// this automatically assumes the role, but the provider needs to be added to the chain
sess, err := session.NewSession()
if err != nil {
return nil, errors.Wrap(err, "error creating a new session to create a WebIdentityRoleProvider")
}
//Add the web identity role credential provider
providers = append(providers, stscreds.NewWebIdentityRoleProvider(sts.New(sess), roleARN, sessionName, tokenPath))
}

It appears that it should add this credential to the Credential Chain.

To isolate the issue further I removed access to the metadata of the underlying instance so the instance role could not be reached by following the instructions here: https://docs.aws.amazon.com/eks/latest/userguide/restrict-ec2-credential-access.html

This resulted in an error stating that no credentials could be loaded.

To Reproduce

  1. Create a AWS KMS Key
  2. Create AWS IAM Role and AWS IAM Policy, linked to that role, which provide access to the KMS Key created in 1.
  3. Add the annotation for server.serviceaccount in values.yaml
server:
  serviceAccount:
    annotations: |
      eks.amazonaws.com/role-arn: "arn:aws:iam::XXXXXXXXX:role/VaultIAMRole-XXXXXXXX"
  1. Install chart with the modified values.yaml file created in 3.
  2. Run kubectl logs vault-0

Expected behavior
Vault should use the IAM Role provided by the annotation to perform the KMS auto-unseal and not use the Instance IAM Role

Environment

  • Kubernetes version: 1.16
    • Distribution or cloud vendor: Amazon EKS
  • vault-helm version: latest

Chart values:

USER-SUPPLIED VALUES:
server:
  extraEnvironmentVars:
    AWS_ROLE_SESSION_NAME: some_name
  ha:
    enabled: true
    nodes: 5
    raft:
      config: |
        ui = true

        listener "tcp" {
          tls_disable = 1
          address = "[::]:8200"
          cluster_address = "[::]:8201"
        }

        storage "raft" {
          path    = "/vault/data"
        }

        service_registration "kubernetes" {}

        seal "awskms" {
          region     = "eu-west-1"
          kms_key_id = "e2cbe661-4624-4278-XXXXXXXXXXXXX"
        }
      enabled: true
      setNodeId: true
  image:
    repository: vault
    tag: 1.4.3
  logLevel: debug
  serviceAccount:
    annotations: |
      eks.amazonaws.com/role-arn: "arn:aws:iam::8888XXXXXXXX:role/VaultIAMRole-XXXXXXXXX"
ui:
  enabled: true

Note: I tested with 1.4.3 and 1.5.0 images, I included the AWS_ROLE_SESSION_NAME: some_name so as not to run afoul of this issue: #8844

kubectl get pods vault-0 -o yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2020-07-22T22:44:02Z"
  generateName: vault-
  labels:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/name: vault
    component: server
    controller-revision-hash: vault-7d6fd78588
    helm.sh/chart: vault-0.6.0
    statefulset.kubernetes.io/pod-name: vault-0
  name: vault-0
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: vault
    uid: afbf95db-a3a4-4bc2-ac66-12786a780b5e
  resourceVersion: "394358"
  selfLink: /api/v1/namespaces/default/pods/vault-0
  uid: 1d657ddf-1e5c-4364-a001-54f14ed501ca
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/instance: vault
            app.kubernetes.io/name: vault
            component: server
        topologyKey: kubernetes.io/hostname
  containers:
  - args:
    - "sed -E \"s/HOST_IP/${HOST_IP?}/g\" /vault/config/extraconfig-from-values.hcl
      > /tmp/storageconfig.hcl;\nsed -Ei \"s/POD_IP/${POD_IP?}/g\" /tmp/storageconfig.hcl;\n/usr/local/bin/docker-entrypoint.sh
      vault server -config=/tmp/storageconfig.hcl \n"
    command:
    - /bin/sh
    - -ec
    env:
    - name: HOST_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: VAULT_K8S_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: VAULT_K8S_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: VAULT_ADDR
      value: http://127.0.0.1:8200
    - name: VAULT_API_ADDR
      value: http://$(POD_IP):8200
    - name: SKIP_CHOWN
      value: "true"
    - name: SKIP_SETCAP
      value: "true"
    - name: HOSTNAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: VAULT_CLUSTER_ADDR
      value: https://$(HOSTNAME).vault-internal:8201
    - name: VAULT_RAFT_NODE_ID
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: HOME
      value: /home/vault
    - name: AWS_ROLE_SESSION_NAME
      value: some_name
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::8888XXXXXXXX:role/VaultIAMRole-XXXXXXXXXX
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    image: vault:1.4.3
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - sleep 5 && kill -SIGTERM $(pidof vault)
    name: vault
    ports:
    - containerPort: 8200
      name: http
      protocol: TCP
    - containerPort: 8201
      name: https-internal
      protocol: TCP
    - containerPort: 8202
      name: http-rep
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -ec
        - vault status -tls-skip-verify
      failureThreshold: 2
      initialDelaySeconds: 5
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 5
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /vault/data
      name: data
    - mountPath: /vault/config
      name: config
    - mountPath: /home/vault
      name: home
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: vault-token-5p8fw
      readOnly: true
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: vault-0
  nodeName: ip-10-0-62-242.eu-west-1.compute.internal
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000
    runAsGroup: 1000
    runAsNonRoot: true
    runAsUser: 100
  serviceAccount: vault
  serviceAccountName: vault
  subdomain: vault-internal
  terminationGracePeriodSeconds: 10
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
  - name: data
    persistentVolumeClaim:
      claimName: data-vault-0
  - configMap:
      defaultMode: 420
      name: vault-config
    name: config
  - emptyDir: {}
    name: home
  - name: vault-token-5p8fw
    secret:
      defaultMode: 420
      secretName: vault-token-5p8fw
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-07-22T22:44:02Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-07-22T22:44:02Z"
    message: 'containers with unready status: [vault]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-07-22T22:44:02Z"
    message: 'containers with unready status: [vault]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-07-22T22:44:02Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://1e77c80a77104f5776c2b6dd13668530c82e8fd1e8f7dc7f83da21cc54043b23
    image: vault:1.4.3
    imageID: docker-pullable://vault@sha256:be3a63905b4d78d6e7c24fc9aacb30621d7a8a1ad8e8baff286a4ba110db16e8
    lastState:
      terminated:
        containerID: docker://dba473a91e8832975f3ece268cd2504d635897f922dabe5d4413e841c0123faa
        exitCode: 1
        finishedAt: "2020-07-22T22:56:03Z"
        reason: Error
        startedAt: "2020-07-22T22:53:58Z"
    name: vault
    ready: false
    restartCount: 5
    started: true
    state:
      running:
        startedAt: "2020-07-22T22:57:35Z"
  hostIP: 10.0.62.242
  phase: Running
  podIP: 10.0.51.106
  podIPs:
  - ip: 10.0.51.106
  qosClass: BestEffort
  startTime: "2020-07-22T22:44:02Z"

Additional context
Please let me know if there is more info you would like me to present

@gargana gargana changed the title Helm deployment on EKS: values.yml server.serviceaccount.annotations AWS IAM Role not working as expected Helm deployment on Amazon EKS: values.yml server.serviceaccount.annotations AWS IAM Role not working as expected for KMS auto-unseal Jul 22, 2020
@tvoran tvoran transferred this issue from hashicorp/vault-helm Jul 23, 2020
@tvoran tvoran added core/seal bug Used to indicate a potential bug labels Jul 23, 2020
@damscott
Copy link

Are you using the 1.5.0-rc image? I encountered a similar problem (EKS 1.16, vault-helm 0.6.0, IAM Roles for Service Accounts) where Vault was using the instance profile role. I was using the 1.5.0-rc image because the 1.5.0 image wasn't available on Docker Hub yet. When I switched to the 1.5.0 image Vault was able to use the role from the ServiceAccount annotation.

Unrelated Quality of Life Improvement Tip: You can use a KMS alias instead of putting the Key ID in your config:

seal "awskms" {
  region     = "eu-west-1"
  kms_key_id = "alias/vault-seal"
}

@gargana
Copy link
Author

gargana commented Jul 23, 2020

Are you using the 1.5.0-rc image? I encountered a similar problem (EKS 1.16, vault-helm 0.6.0, IAM Roles for Service Accounts) where Vault was using the instance profile role. I was using the 1.5.0-rc image because the 1.5.0 image wasn't available on Docker Hub yet. When I switched to the 1.5.0 image Vault was able to use the role from the ServiceAccount annotation.

Unrelated Quality of Life Improvement Tip: You can use a KMS alias instead of putting the Key ID in your config:

seal "awskms" {
  region     = "eu-west-1"
  kms_key_id = "alias/vault-seal"
}

I tested using the 1.5.0 and 1.4.3 images specifically.

Thanks for the tip on the QOL key alias.

I did test the same role on a service account on the same cluster and was able to get the intended role
eg:

$ aws sts get-caller-identity
 {
    "Account": "8888XXXXXXXX", 
    "UserId": "AROA453FGXXXXXXXXXX:botocore-session-XXXXXXXXX", 
    "Arn": "arn:aws:sts::8888XXXXXXXX:assumed-role/VaultIAMRole-XXXXXXXX/botocore-session-XXXXXXXXX"
}

@hashicorp hashicorp deleted a comment Jul 23, 2020
@tvoran
Copy link
Member

tvoran commented Jul 24, 2020

Are there some steps missing in your setup? I ask because following these instructions works for me with 1.5 and 1.4.3 (w/setting AWS_ROLE_SESSION_NAME): https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/

Specifically I think the eksctl command for adding an iam oidc provider and using eksctl to create the iam role might be setting up the missing pieces. For instance, I wasn't able to run aws sts assume-role-with-web-identity in a pod until I had done those:

aws sts assume-role-with-web-identity \
  --role-arn $AWS_ROLE_ARN \
  --web-identity-token file://$AWS_WEB_IDENTITY_TOKEN_FILE \
  --role-session-name test-session

@biswa-r-singh
Copy link

biswa-r-singh commented Jul 26, 2020

So annoyed. Seriously this role switch does not work whatever i tried. Spent whole day did exactly what ever told and more than that.

@gargana
Copy link
Author

gargana commented Jul 27, 2020

Are there some steps missing in your setup? I ask because following these instructions works for me with 1.5 and 1.4.3 (w/setting AWS_ROLE_SESSION_NAME): https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/

Specifically I think the eksctl command for adding an iam oidc provider and using eksctl to create the iam role might be setting up the missing pieces. For instance, I wasn't able to run aws sts assume-role-with-web-identity in a pod until I had done those:

aws sts assume-role-with-web-identity \
  --role-arn $AWS_ROLE_ARN \
  --web-identity-token file://$AWS_WEB_IDENTITY_TOKEN_FILE \
  --role-session-name test-session

Hi, I have all the setup already including the the OIDC provider and role setup. Using awscli in the cluster with a service account configured in the same fashion allows me to make use of the intended role as mentioned in my previous response.

eg:

$ aws sts get-caller-identity
 {
    "Account": "8888XXXXXXXX", 
    "UserId": "AROA453FGXXXXXXXXXX:botocore-session-XXXXXXXXX", 
    "Arn": "arn:aws:sts::8888XXXXXXXX:assumed-role/VaultIAMRole-XXXXXXXX/botocore-session-XXXXXXXXX"
}

As you can see the Expected role is used and not the underlying instance role. Again this is on the same EKS cluster and EKS nodes.

This is the same role as you can see in the "server.serviceAccount.annotations" passed into the values.yaml and you can see the appropriate values are set on the "stateful set" pod.

From the information I posted above in the pod manifest:

    - name: AWS_ROLE_SESSION_NAME
      value: some_name
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::8888XXXXXXXX:role/VaultIAMRole-XXXXXXXXXX
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    image: vault:1.4.3

The error in the pod logs is:

$ kubectl logs pod/vault-0
Error parsing Seal configuration: error fetching AWS KMS wrapping key information: 
AccessDeniedException: User:
 arn:aws:sts::8888XXXXXXXX:assumed-role/tCaT-quickstart-amazon-eks-gargan-NodeInstanceRole-XXXXXXXXXXX/i-083XXXXXXXXXXX
 is not authorized to perform: kms:DescribeKey on resource: 
arn:aws:kms:eu-west-1:8888XXXXXXXX:key/e4384c07-5f11-4372-bce0-XXXXXXXX
	status code: 400, request id: 74f9397a-0925-4f98-bf5c-XXXXXXXXXX

As you can see the underlying instance role is being used ... not the role specified in the annotations.

Is there further detailed information I should explore that can help pinpoint this issue?

One thing that seems odd to me is that the token does not appear to be mounted by the pods:
eg:

/var/run/secrets/eks.amazonaws.com/serviceaccount/token

This does not appear in the "mounts:" section. Is this potentially a Helm Chart problem?
I wasn't looking properly ... this is actually mounted as expected

@tvoran
Copy link
Member

tvoran commented Jul 27, 2020

Hi there, so I don't believe aws sts get-caller-identity is sufficient to check whether the IAM Role for the k8s service account is configured correctly. Vault needs to be able to run assume-role-with-web-identity (see my previous comment for an awscli example) in order to get credentials for interacting with the kms. At the very least, running aws sts assume-role-with-web-identity in a pod in your setup should give you some clue as to why it's failing in vault.

I've also noticed that when I create an iamserviceaccount with eksctl, the role that is created has specific trust relationships setup such that using the role's ARN only works for auth when it's used with the k8s serviceaccount with the same name and namespace that I entered as parameters.

So if I annotate a k8s service account with a different name or namespace, an eks token is still projected into the pod, and the AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE vars are set correctly, and aws sts get-caller-identity also returns successfully. But aws sts assume-role-with-web-identity fails to assume the role and fetch credentials.

And yes, the error coming back from vault only mentions the node role, but in this case I believe the AWS auth code in vault tries AssumeRoleWithWebIdentity first, and then eventually falls back to auth'ing with the node role before giving up. I've verified this by adding some tracing to the AWS client setup, but haven't been able to get more info logged from vault in this part of the process.

@gargana
Copy link
Author

gargana commented Jul 28, 2020

Hi

I am using the following to test this now:

kubectl run --serviceaccount=vault --rm -i --tty --attach amazonlinux --image=amazonlinux -- /bin/bash -c "yum update -y && yum install awscli -y && aws sts get-caller-identity && aws sts assume-role-with-web-identity --role-session-name test --role-arn arn:aws:iam::888XXXXXXXXX:role/kubetest --web-identity-token file://var/run/secrets/eks.amazonaws.com/serviceaccount/token"

I believe my mistake was the trust policy on the IAM Role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::888XXXXXXXXX:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/CXXXXXXXXXXXXXXXX9DF412"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
            "oidc.eks.eu-west-1.amazonaws.com/id/CXXXXXXXXXXXXXXXXX9DF412:sub": " system:serviceaccount:default:booter-script-vault"
        }
      }
    }
  ]
}

Before I only had the "booter-script-vault" service account mapped as I was running a kubernetes job to automate some of the Vault bootstrapping tasks.

I have now added an additional IAM role with a trust policy for the "vault" service account in the trust policy.
eg:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::888XXXXXXXXX:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/CXXXXXXXXXXXXXXXX9DF412"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
            "oidc.eks.eu-west-1.amazonaws.com/id/CXXXXXXXXXXXXXXXXX9DF412:sub": " system:serviceaccount:default:vault"
        }
      }
    }
  ]
}

I was testing with the "booter-script-vault" service account which worked and I presumed since I had annotated the same IAM Role in the Helm chart that it should just work.

I think the lack of an error from the vault image around the failed "AssumeRoleWithWebIdentity" led me to believe the credential logic never fired somehow.

@gargana gargana closed this as completed Jul 28, 2020
@biswa-r-singh
Copy link

Thanks a lot @gargana, finally it worked.!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/seal
Projects
None yet
Development

No branches or pull requests

4 participants