Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS storage - failing to fetch credentials in GKE using workload identity #4396

Open
nmdanny opened this issue Jan 2, 2025 · 8 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@nmdanny
Copy link

nmdanny commented Jan 2, 2025

Describe the bug

Dragonfly with GCS storage, fails to initialize when running within a GKE (Kubernetes) pod configured to authenticate via workload identity.

I20250102 14:47:52.847270    14 gcs.cc:46] Could not find ~/.config/gcloud
E20250102 14:47:52.847338     1 server_family.cc:895] Failed to initialize GCS snapshot storage: No such file or directory

Looking at the relevant code
https://github.com/romange/helio/blob/493804db4110cf1631f787dd14484efc57f9575d/util/cloud/gcp/gcs.cc#L203C1-L227C1

The folder ~/.config/gcloud (or the file ~/.config/gcloud/gce) doesn't exist when using GKE (and I assume other containerized envs like Cloud Run)

IMO, a solution here is to assume is_cloud_env = True if ~/.config/gcloud doesn't exist.

Furthermore, mounting the file True to /home/dfly/.config/gcloud/gce still yields the same error. (Even after creating and chowning the entire /home/dfly folder to dfly)
Adding --reset-env to the exec setpriv command in entrypoint.sh fixes the problem

To Reproduce

(GKE instructions, I assume Cloud Run, or using VMs might be simpler?)

  1. Create a GCP service account, and K8S service account (or use the default one in a namespace)
  2. Grant that GCP SA 'Storage Admin' permissions on a bucket
  3. Link the GCP SA to K8S via the following guide
  4. Create the following K8S manifest (adjust gs://my-dragonfly-bucket and my-service-account accordingly
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dragonflydb
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: dragonflydb
  template:
    metadata:
      labels:
        app: dragonflydb
    spec:
      terminationGracePeriodSeconds: 5
      serviceAccountName: my-service-account
      containers:
        image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.26.0
        args:
          - "--dir"
          - "gs://my-dragonfly-bucket"
          - "--v=1"
          - "--logtostderr"
        imagePullPolicy: Always
        ports:
        - containerPort: 6379

Expected behavior

Dragonfly should fetch the creds via the metadata endpoint despite ~/.config/gcloud not existing

Environment (please complete the following information):

  • OS: Same as the image (Ubuntu 22.04)
  • Kernel: 6.1.100+
  • Containerized?: Kubernetes (GKE)
  • Dragonfly Version: v1.26.0
@nmdanny nmdanny added the bug Something isn't working label Jan 2, 2025
@romange
Copy link
Collaborator

romange commented Jan 2, 2025

Yeah, I realize now that the logic there is not very good. Seems that we can also identify cloud environments by checking for /etc/google_instance_id and /etc/cloud/. Can you please check if Dragonfly container on GKE has access to these files?

@nmdanny
Copy link
Author

nmdanny commented Jan 2, 2025

I don't see anything google specific. AFAIK, GKE doesn't mount anything Google specific to pods (at least, nothing beyond what Kubernetes does)

There is this file which I think is from the container itself?

cat /etc/cloud/build.info
serial: 20240911.1

@romange
Copy link
Collaborator

romange commented Jan 2, 2025

Ah, I see now that /etc/cloud/ is a local folder on container that we build. So it's not possible to identify whether it runs on the cloud via fs checks. ok, can you please confirm that you can access metadata.google.internal from the container?

@nmdanny
Copy link
Author

nmdanny commented Jan 2, 2025

Yes - the feature works fine once I apply the workaround (of creating ~/.config/gcloud/gce)

(i don't have curl/wget on the container to directly test this)

IMO there's no need to explicitly check for the existence of the metadata server, per the GCP docs, it is always last in priority (so, if GOOGLE_APPLICATION_CREDENTIALS does not exist, and the local gcloud CLI config doesn't exist, then you can just immediately try to fetch a token from the metadata server)

By the way, I haven't seen any reference to GOOGLE_APPLICATION_CREDENTIALS in the code, that is also a popular way of passing credentials

@romange
Copy link
Collaborator

romange commented Jan 3, 2025

Thanks for providing the reference to the correct spec, Daniel 👏🏼
I see you are/were a student in at HUJI, and have c++ experience. Would you like to fix the problem yourself?

@romange
Copy link
Collaborator

romange commented Jan 3, 2025

btw, you mentioned --reset-env with setpriv. Do you know why this would affect Dragonfly's behaviour in this context?

@nmdanny
Copy link
Author

nmdanny commented Jan 5, 2025

Sure, I can try fixing (though my C++ is a bit rusty)

Regarding setpriv - by default this command does not modify your env vars, namely, $HOME would still point to /root (which the dfly user has no access to), thus globbing ~/.config/gcloud would fail. (Though, the fact dragonfly says no such file exists seems misleading. Maybe something with how the glob function works?)

This can be demonstrated by opening a shell into the container (root) and running:

setpriv --reuid=dfly --regid=dfly --clear-groups  -- bash
bash: /root/.bashrc: Permission denied

dfly@dragonflydb-79cfb45bff-29d5t:/data$ echo $HOME
/root

@romange
Copy link
Collaborator

romange commented Jan 5, 2025

Sure, I can try fixing (though my C++ is a bit rusty)

Regarding setpriv - by default this command does not modify your env vars, namely, $HOME would still point to /root (which the dfly user has no access to), thus globbing ~/.config/gcloud would fail. (Though, the fact dragonfly says no such file exists seems misleading. Maybe something with how the glob function works?)

This can be demonstrated by opening a shell into the container (root) and running:

setpriv --reuid=dfly --regid=dfly --clear-groups  -- bash
bash: /root/.bashrc: Permission denied

dfly@dragonflydb-79cfb45bff-29d5t:/data$ echo $HOME
/root

Thank you for the explanation. For now, I will not proceed with a fix and will wait for you to check. Please let me know if you prefer a different course of action. As you saw, all the relevant code is located in the "helio" repo. I typically use "gcs_demo" for manually checking issues related to Google Cloud Storage access.

Please let me know if you need any help with building helio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants