Default workload identity bindings for deployments that need GCP permissions #61

Bobgy · 2020-06-23T09:00:15Z

I've found quite a few deployments don't work properly due to lack of GCP permissions, including:

profiles manager

2020-06-23T08:51:18.266Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "profile", "request": "/default-profile", "error": "googleapi: Error 403: Permission iam.serviceAccounts.getIamPolicy is required to perform this operation on service account projects/gongyuan-pipeline-test/serviceAccounts/[email protected]., forbidden"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

pipelines ui, visualization server deployments
default-editor (and default-viewer, maybe later) KSA in user namespace (needed by pipeline runs, tensorboard instances)

/cc @jlewi

I think we should bind workload identity with these KSAs by default.

Proposal

Add workload identity binding for

profiles manager - admin GSA
pipelines ui, visualization server - user GSA

default-editor will be automatically bound to user GSA when profiles manager works properly.

@jlewi I can implement this, thoughts?

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-06-23T09:00:23Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/bug	0.82

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Bobgy · 2020-06-23T09:00:53Z

/assign @Bobgy

jlewi · 2020-06-23T13:34:12Z

@Bobgy Isn't the profiles-manager already bound to a GSA?

jlewi · 2020-06-23T13:40:10Z

I looked at one of the auto deployments

It looks like profiles manager is using KSA: serviceAccount: profiles-controller-service-account

Which doesn't have a GSA

kubectl -n kubeflow get serviceAccount -o yaml profiles-controller-service-account
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"labels":{"kustomize.component":"profiles"},"name":"profiles-controller-service-account","namespace":"kubeflow"}}
  creationTimestamp: "2020-06-23T11:45:21Z"
  labels:
    kustomize.component: profiles
  name: profiles-controller-service-account
  namespace: kubeflow
  resourceVersion: "3510"
  selfLink: /api/v1/namespaces/kubeflow/serviceaccounts/profiles-controller-service-account
  uid: 04de24e6-b547-11ea-8696-42010a800149
secrets:
- name: profiles-controller-service-account-token-bvcxp

I think this is a bug. If the profiles manager isn't bound to any GSA it won't be able to attach GSA's to KSAs in newly provisioned namespaces.

Looking at a v1 deployment it looks like it is bound to the admin account

kubectl --context=kf-v1-0210 -n kubeflow  get serviceaccount -o yaml profiles-controller-service-account
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: [email protected]
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"profiles","app.kubernetes.io/instance":"profiles-v1.0.0","app.kubernetes.io/managed-by":"kfctl","app.kubernetes.io/name":"profiles","app.kubernetes.io/part-of":"kubeflow","app.kubernetes.io/version":"v1.0.0","kustomize.component":"profiles"},"name":"profiles-controller-service-account","namespace":"kubeflow"}}
  creationTimestamp: "2020-02-10T19:12:36Z"
  labels:
    app.kubernetes.io/component: profiles
    app.kubernetes.io/instance: profiles-v1.0.0
    app.kubernetes.io/managed-by: kfctl
    app.kubernetes.io/name: profiles
    app.kubernetes.io/part-of: kubeflow
    app.kubernetes.io/version: v1.0.0
    kustomize.component: profiles
  name: profiles-controller-service-account
  namespace: kubeflow
  resourceVersion: "5691"
  selfLink: /api/v1/namespaces/kubeflow/serviceaccounts/profiles-controller-service-account
  uid: 4c172377-4c39-11ea-86b4-42010a8e01a3
secrets:
- name: profiles-controller-service-account-token-rxm4l

Bobgy · 2020-06-23T15:51:51Z

that's true, It was originaly bound by kfctl

We need to bind It by kcc now

jlewi · 2020-06-23T21:28:59Z

Thanks @Bobgy that makes sense; thanks for reporting and fixing this

Notebook tests should build a docker image to run the notebook in. * kubeflow#613 currently the way we run notebook tests is by firing off a K8s job on the KF cluster which runs the notebook. * The K8s job uses init containers to pull in source code and install dependencies like papermill. * This is a bit brittle. * To fix this we will instead use Tekton to build a docker image that takes the notebook image and then adds the notebook code to it. * Dockerfile.notebook_runner dockerfile to build the test image. The pipeline to run the notebook consists of two tasks 1. A Tekton Task to build a docker image to run the notebook in 1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster. Here's a list of changes to make this work * tekton_client should provide methods to upload artifacts but not parse junits * Add a tekton_client method to construct the full image URL based on the digest returned from kaniko * Copy over the code for running the notebook tests from kubeflow/examples and start modifying it. * Create a simple CLI to wait for nomos to sync resources to the cluster * This is used in some syntactic sugar make rules to aid the dev-test loop The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF deployments don't have proper GSA's to write to GCS. Related to: kubeflow#613

jlewi · 2020-06-26T20:56:51Z

@Bobgy Lets not close this issue until

fix(gcp): default workload identity bindings kubeflow/manifests#1317 has been cherry-picked onto the 1.1 branch
We have verified in the auto-deployments that the issue is fixed.

jlewi · 2020-06-26T20:59:22Z

@Bobgy Are we going to have to apply the same change as #64 to the auto-deploy script.

https://github.com/kubeflow/testing/blob/637d1cd5fe33d03ee5646380d960fabe8a230d0a/py/kubeflow/testing/create_kf_from_gcp_blueprint.py#L130

issue-label-bot · 2020-06-26T20:59:29Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/kfctl	0.64

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Notebook tests should build a docker image to run the notebook in. * kubeflow#613 currently the way we run notebook tests is by firing off a K8s job on the KF cluster which runs the notebook. * The K8s job uses init containers to pull in source code and install dependencies like papermill. * This is a bit brittle. * To fix this we will instead use Tekton to build a docker image that takes the notebook image and then adds the notebook code to it. * Dockerfile.notebook_runner dockerfile to build the test image. The pipeline to run the notebook consists of two tasks 1. A Tekton Task to build a docker image to run the notebook in 1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster. Here's a list of changes to make this work * tekton_client should provide methods to upload artifacts but not parse junits * Add a tekton_client method to construct the full image URL based on the digest returned from kaniko * Copy over the code for running the notebook tests from kubeflow/examples and start modifying it. * Create a simple CLI to wait for nomos to sync resources to the cluster * This is used in some syntactic sugar make rules to aid the dev-test loop The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF deployments don't have proper GSA's to write to GCS. Related to: kubeflow#613

Bobgy · 2020-06-29T08:18:05Z

The most recent deploy "kf-vbp-0629-571" runs pipelines successfully, but workload identity binding is still not properly setup.

$ k get sa pipeline-runner -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: [email protected]
  name: pipeline-runner
  namespace: kubeflow
  ...

Bobgy · 2020-06-29T08:18:51Z

I'm guessing the change kubeflow/testing#705 didn't make it into the auto deploy. Is there a way we can trigger a manual deploy?

Bobgy · 2020-06-29T08:20:12Z

Looked at the tekton task https://kf-ci-v1.endpoints.kubeflow-ci.cloud.goog/tekton/#/namespaces/auto-deploy/pipelineruns/deploy-kf-master-p5sb2, it's using

params:
  - name: revision
    value: gcp_blueprints
  - name: url
    value: 'https://github.com/jlewi/testing.git'
type: git

@jlewi Can you fix this?

* Revamp how Tekton pipelines to run notebooks work. Notebook tests should build a docker image to run the notebook in. * #613 currently the way we run notebook tests is by firing off a K8s job on the KF cluster which runs the notebook. * The K8s job uses init containers to pull in source code and install dependencies like papermill. * This is a bit brittle. * To fix this we will instead use Tekton to build a docker image that takes the notebook image and then adds the notebook code to it. * Dockerfile.notebook_runner dockerfile to build the test image. The pipeline to run the notebook consists of two tasks 1. A Tekton Task to build a docker image to run the notebook in 1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster. Here's a list of changes to make this work * tekton_client should provide methods to upload artifacts but not parse junits * Add a tekton_client method to construct the full image URL based on the digest returned from kaniko * Copy over the code for running the notebook tests from kubeflow/examples and start modifying it. * Create a simple CLI to wait for nomos to sync resources to the cluster * This is used in some syntactic sugar make rules to aid the dev-test loop The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF deployments don't have proper GSA's to write to GCS. Related to: #613 * tekton_client.py can't use format strings yet because we are still running under python2. * Remove f-style strings. * Fix typo. * Address PR comments. * * copy-buckets should not abort on error as this prevents artifacts from being copied and thus the results from showing up in testgrid see #703

jlewi · 2020-06-29T21:34:49Z

This is currently coming from here:
https://github.com/kubeflow/testing/blob/88c153fb7ab84ea08c5ea7cf21fbf0a30b7814d3/test-infra/auto-deploy/manifest/config/pipeline-run-deploy-blueprint-master.yaml#L35

* Use ACM to deploy the auto-deploy infrastructure; previously ACM was only being used to deploy the Tekton tasks; we weren't including the other manifests in the hydration. * Related to: GoogleCloudPlatform/kubeflow-distribution#61

jlewi · 2020-06-30T02:45:23Z

The current auto-deployment was deployed before kubeflow/testing#707 was merged.

gcp-blueprint-master	kf-vbp-0629-3f5	8:28:38.534253	deploy-kf-master-rpgt4	7cfd7bc		gcloud --project=kubeflow-ci-deployment container clusters get-credentials --region=us-central1 kf-vbp-0629-3f5

The next auto-deployment will hopefully have the fix.

Bobgy · 2020-06-30T02:53:49Z

Thanks for the heads up! I'll wait for the next one then

Bobgy · 2020-06-30T10:15:06Z

This is working properly now. I verified in https://kf-vbp-0630-8f5.endpoints.kubeflow-ci-deployment.cloud.goog/_/pipeline/#/runs/details/c7af3e03-87ab-4a6b-b30c-20d8b9a5a8ea, the tfx taxi sample ran successfully.

Bobgy · 2020-07-06T07:50:02Z

I found a problem, default-editor KSA in user namespaces didn't get correct workload identity binding.

Root cause: I used IAMPolicy for workload identity bindings for kf-user GSA.
However, in multi user mode, profile controller also tries to add additional workload identity bindings to kf-user GSA.
Then, profile controller and KCC controller will periodically overwrite each other's IAM policies.

We should use IAMPolicyMember instead.

jlewi · 2020-07-06T09:34:57Z

@Bobgy Thanks; we will need to cherry-pick this into the 1.0 branch of manifests.

Bobgy · 2020-07-06T10:14:52Z

We should verify permissions of notebooks whether this is good in auto deploys before closing this.

issue-label-bot bot added the kind/bug label Jun 23, 2020

k8s-ci-robot assigned Bobgy Jun 23, 2020

jlewi added priority/p0 platform/gcp labels Jun 23, 2020

jlewi mentioned this issue Jun 24, 2020

Revamp how Tekton pipelines to run notebooks work. kubeflow/testing#703

Merged

This was referenced Jun 24, 2020

fix(gcp): default workload identity bindings kubeflow/manifests#1317

Merged

fix(gcp): default workload identity bindings #64

Merged

issue-label-bot bot added the area/kfctl label Jun 26, 2020

This was referenced Jun 28, 2020

Cherrypick "fix(gcp): default workload identity bindings (#1317)" kubeflow/manifests#1331

Merged

let kpt set value in upstream/manifests/stacks/gcp kubeflow/testing#705

Merged

jlewi mentioned this issue Jun 29, 2020

Auto-deploy should use testing on master; kubeflow/testing#707

Merged

Bobgy closed this as completed Jun 30, 2020

Bobgy reopened this Jul 6, 2020

Bobgy mentioned this issue Jul 6, 2020

fix(gcp): Use IAMPolicyMember for workload identity bindings kubeflow/manifests#1347

Merged

1 task

Bobgy mentioned this issue Jul 6, 2020

Cherry pick "fix(gcp): Use IAMPolicyMember for workload identity bindings #1347" kubeflow/manifests#1349

Merged

jlewi mentioned this issue Jul 6, 2020

Get the mnist notebook tests to pass #65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default workload identity bindings for deployments that need GCP permissions #61

Default workload identity bindings for deployments that need GCP permissions #61

Bobgy commented Jun 23, 2020

issue-label-bot bot commented Jun 23, 2020

Bobgy commented Jun 23, 2020

jlewi commented Jun 23, 2020

jlewi commented Jun 23, 2020

Bobgy commented Jun 23, 2020

jlewi commented Jun 23, 2020

jlewi commented Jun 26, 2020

jlewi commented Jun 26, 2020

issue-label-bot bot commented Jun 26, 2020

Bobgy commented Jun 29, 2020

Bobgy commented Jun 29, 2020

Bobgy commented Jun 29, 2020

jlewi commented Jun 29, 2020

jlewi commented Jun 30, 2020

Bobgy commented Jun 30, 2020

Bobgy commented Jun 30, 2020

Bobgy commented Jul 6, 2020 •

edited

Loading

jlewi commented Jul 6, 2020

Bobgy commented Jul 6, 2020 •

edited

Loading

Default workload identity bindings for deployments that need GCP permissions #61

Default workload identity bindings for deployments that need GCP permissions #61

Comments

Bobgy commented Jun 23, 2020

Proposal

issue-label-bot bot commented Jun 23, 2020

Bobgy commented Jun 23, 2020

jlewi commented Jun 23, 2020

jlewi commented Jun 23, 2020

Bobgy commented Jun 23, 2020

jlewi commented Jun 23, 2020

jlewi commented Jun 26, 2020

jlewi commented Jun 26, 2020

issue-label-bot bot commented Jun 26, 2020

Bobgy commented Jun 29, 2020

Bobgy commented Jun 29, 2020

Bobgy commented Jun 29, 2020

jlewi commented Jun 29, 2020

jlewi commented Jun 30, 2020

Bobgy commented Jun 30, 2020

Bobgy commented Jun 30, 2020

Bobgy commented Jul 6, 2020 • edited Loading

jlewi commented Jul 6, 2020

Bobgy commented Jul 6, 2020 • edited Loading

Bobgy commented Jul 6, 2020 •

edited

Loading

Bobgy commented Jul 6, 2020 •

edited

Loading