Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default workload identity bindings for deployments that need GCP permissions #61

Open
Bobgy opened this issue Jun 23, 2020 · 19 comments
Open

Comments

@Bobgy
Copy link
Contributor

Bobgy commented Jun 23, 2020

I've found quite a few deployments don't work properly due to lack of GCP permissions, including:

  • profiles manager
2020-06-23T08:51:18.266Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "profile", "request": "/default-profile", "error": "googleapi: Error 403: Permission iam.serviceAccounts.getIamPolicy is required to perform this operation on service account projects/gongyuan-pipeline-test/serviceAccounts/[email protected]., forbidden"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88
  • pipelines ui, visualization server deployments
  • default-editor (and default-viewer, maybe later) KSA in user namespace (needed by pipeline runs, tensorboard instances)

/cc @jlewi

I think we should bind workload identity with these KSAs by default.

Proposal

Add workload identity binding for

  • profiles manager - admin GSA
  • pipelines ui, visualization server - user GSA

default-editor will be automatically bound to user GSA when profiles manager works properly.

@jlewi I can implement this, thoughts?

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.82

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 23, 2020

/assign @Bobgy

@jlewi
Copy link
Contributor

jlewi commented Jun 23, 2020

@Bobgy Isn't the profiles-manager already bound to a GSA?

@jlewi
Copy link
Contributor

jlewi commented Jun 23, 2020

I looked at one of the auto deployments

It looks like profiles manager is using KSA: serviceAccount: profiles-controller-service-account

Which doesn't have a GSA

kubectl -n kubeflow get serviceAccount -o yaml profiles-controller-service-account
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"labels":{"kustomize.component":"profiles"},"name":"profiles-controller-service-account","namespace":"kubeflow"}}
  creationTimestamp: "2020-06-23T11:45:21Z"
  labels:
    kustomize.component: profiles
  name: profiles-controller-service-account
  namespace: kubeflow
  resourceVersion: "3510"
  selfLink: /api/v1/namespaces/kubeflow/serviceaccounts/profiles-controller-service-account
  uid: 04de24e6-b547-11ea-8696-42010a800149
secrets:
- name: profiles-controller-service-account-token-bvcxp

I think this is a bug. If the profiles manager isn't bound to any GSA it won't be able to attach GSA's to KSAs in newly provisioned namespaces.

Looking at a v1 deployment it looks like it is bound to the admin account

kubectl --context=kf-v1-0210 -n kubeflow  get serviceaccount -o yaml profiles-controller-service-account
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: [email protected]
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"profiles","app.kubernetes.io/instance":"profiles-v1.0.0","app.kubernetes.io/managed-by":"kfctl","app.kubernetes.io/name":"profiles","app.kubernetes.io/part-of":"kubeflow","app.kubernetes.io/version":"v1.0.0","kustomize.component":"profiles"},"name":"profiles-controller-service-account","namespace":"kubeflow"}}
  creationTimestamp: "2020-02-10T19:12:36Z"
  labels:
    app.kubernetes.io/component: profiles
    app.kubernetes.io/instance: profiles-v1.0.0
    app.kubernetes.io/managed-by: kfctl
    app.kubernetes.io/name: profiles
    app.kubernetes.io/part-of: kubeflow
    app.kubernetes.io/version: v1.0.0
    kustomize.component: profiles
  name: profiles-controller-service-account
  namespace: kubeflow
  resourceVersion: "5691"
  selfLink: /api/v1/namespaces/kubeflow/serviceaccounts/profiles-controller-service-account
  uid: 4c172377-4c39-11ea-86b4-42010a8e01a3
secrets:
- name: profiles-controller-service-account-token-rxm4l

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 23, 2020

that's true, It was originaly bound by kfctl

We need to bind It by kcc now

@jlewi
Copy link
Contributor

jlewi commented Jun 23, 2020

Thanks @Bobgy that makes sense; thanks for reporting and fixing this

jlewi pushed a commit to jlewi/testing that referenced this issue Jun 24, 2020
Notebook tests should build a docker image to run the notebook in.

* kubeflow#613 currently the way we run notebook tests is
  by firing off a K8s job on the KF cluster which runs the notebook.

  * The K8s job uses init containers to pull in source code and install
    dependencies like papermill.

  * This is a bit brittle.

* To fix this we will instead use Tekton to build a docker image that
  takes the notebook image and then adds the notebook code to it.

  * Dockerfile.notebook_runner dockerfile to build the test image.

The pipeline to run the notebook consists of two tasks

  1. A Tekton Task to build a docker image to run the notebook in

  1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster.

Here's a list of changes to make this work

* tekton_client should provide methods to upload artifacts but not parse
  junits

* Add a tekton_client method to construct the full image URL based on
  the digest returned from kaniko

* Copy over the code for running the notebook tests from kubeflow/examples
and start modifying it.

* Create a simple CLI to wait for nomos to sync resources to the cluster
  * This is used in some syntactic sugar make rules to aid the dev-test loop

The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF
deployments don't have proper GSA's to write to GCS.

Related to: kubeflow#613
@jlewi
Copy link
Contributor

jlewi commented Jun 26, 2020

@Bobgy Lets not close this issue until

  1. fix(gcp): default workload identity bindings kubeflow/manifests#1317 has been cherry-picked onto the 1.1 branch
  2. We have verified in the auto-deployments that the issue is fixed.

@jlewi
Copy link
Contributor

jlewi commented Jun 26, 2020

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/kfctl 0.64

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

jlewi pushed a commit to jlewi/testing that referenced this issue Jun 26, 2020
Notebook tests should build a docker image to run the notebook in.

* kubeflow#613 currently the way we run notebook tests is
  by firing off a K8s job on the KF cluster which runs the notebook.

  * The K8s job uses init containers to pull in source code and install
    dependencies like papermill.

  * This is a bit brittle.

* To fix this we will instead use Tekton to build a docker image that
  takes the notebook image and then adds the notebook code to it.

  * Dockerfile.notebook_runner dockerfile to build the test image.

The pipeline to run the notebook consists of two tasks

  1. A Tekton Task to build a docker image to run the notebook in

  1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster.

Here's a list of changes to make this work

* tekton_client should provide methods to upload artifacts but not parse
  junits

* Add a tekton_client method to construct the full image URL based on
  the digest returned from kaniko

* Copy over the code for running the notebook tests from kubeflow/examples
and start modifying it.

* Create a simple CLI to wait for nomos to sync resources to the cluster
  * This is used in some syntactic sugar make rules to aid the dev-test loop

The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF
deployments don't have proper GSA's to write to GCS.

Related to: kubeflow#613
@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 29, 2020

The most recent deploy "kf-vbp-0629-571" runs pipelines successfully, but workload identity binding is still not properly setup.

$ k get sa pipeline-runner -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: [email protected]
  name: pipeline-runner
  namespace: kubeflow
  ...

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 29, 2020

I'm guessing the change kubeflow/testing#705 didn't make it into the auto deploy. Is there a way we can trigger a manual deploy?

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 29, 2020

Looked at the tekton task https://kf-ci-v1.endpoints.kubeflow-ci.cloud.goog/tekton/#/namespaces/auto-deploy/pipelineruns/deploy-kf-master-p5sb2, it's using

params:
  - name: revision
    value: gcp_blueprints
  - name: url
    value: 'https://github.com/jlewi/testing.git'
type: git

@jlewi Can you fix this?

k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Jun 29, 2020
* Revamp how Tekton pipelines to run notebooks work.

Notebook tests should build a docker image to run the notebook in.

* #613 currently the way we run notebook tests is
  by firing off a K8s job on the KF cluster which runs the notebook.

  * The K8s job uses init containers to pull in source code and install
    dependencies like papermill.

  * This is a bit brittle.

* To fix this we will instead use Tekton to build a docker image that
  takes the notebook image and then adds the notebook code to it.

  * Dockerfile.notebook_runner dockerfile to build the test image.

The pipeline to run the notebook consists of two tasks

  1. A Tekton Task to build a docker image to run the notebook in

  1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster.

Here's a list of changes to make this work

* tekton_client should provide methods to upload artifacts but not parse
  junits

* Add a tekton_client method to construct the full image URL based on
  the digest returned from kaniko

* Copy over the code for running the notebook tests from kubeflow/examples
and start modifying it.

* Create a simple CLI to wait for nomos to sync resources to the cluster
  * This is used in some syntactic sugar make rules to aid the dev-test loop

The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF
deployments don't have proper GSA's to write to GCS.

Related to: #613

* tekton_client.py can't use format strings yet because we are still running under python2.

* Remove f-style strings.

* Fix typo.

* Address PR comments.

* * copy-buckets should not abort on error as this prevents artifacts
  from being copied and thus the results from showing up in testgrid
  see #703
jlewi pushed a commit to jlewi/testing that referenced this issue Jun 29, 2020
* Use ACM to deploy the auto-deploy infrastructure; previously ACM was
  only being used to deploy the Tekton tasks; we weren't including the
  other manifests in the hydration.

* Related to: GoogleCloudPlatform/kubeflow-distribution#61
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Jun 29, 2020
* Use ACM to deploy the auto-deploy infrastructure; previously ACM was
  only being used to deploy the Tekton tasks; we weren't including the
  other manifests in the hydration.

* Related to: GoogleCloudPlatform/kubeflow-distribution#61
@jlewi
Copy link
Contributor

jlewi commented Jun 30, 2020

The current auto-deployment was deployed before kubeflow/testing#707 was merged.

gcp-blueprint-master	kf-vbp-0629-3f5	8:28:38.534253	deploy-kf-master-rpgt4	7cfd7bc		gcloud --project=kubeflow-ci-deployment container clusters get-credentials --region=us-central1 kf-vbp-0629-3f5

The next auto-deployment will hopefully have the fix.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 30, 2020

Thanks for the heads up! I'll wait for the next one then

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 30, 2020

This is working properly now. I verified in https://kf-vbp-0630-8f5.endpoints.kubeflow-ci-deployment.cloud.goog/_/pipeline/#/runs/details/c7af3e03-87ab-4a6b-b30c-20d8b9a5a8ea, the tfx taxi sample ran successfully.

@Bobgy Bobgy closed this as completed Jun 30, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 6, 2020

I found a problem, default-editor KSA in user namespaces didn't get correct workload identity binding.

Root cause: I used IAMPolicy for workload identity bindings for kf-user GSA.
However, in multi user mode, profile controller also tries to add additional workload identity bindings to kf-user GSA.
Then, profile controller and KCC controller will periodically overwrite each other's IAM policies.

We should use IAMPolicyMember instead.

@jlewi
Copy link
Contributor

jlewi commented Jul 6, 2020

@Bobgy Thanks; we will need to cherry-pick this into the 1.0 branch of manifests.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 6, 2020

We should verify permissions of notebooks whether this is good in auto deploys before closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants