-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata agent not working: Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request #25
Comments
In order for the metadata agent to work, your project must be signed up for Stackdriver, and have the Stackdriver API enabled in the Google Cloud console. Can you please confirm that you've done both of these? |
Oh, I just realized you're saying it used to work, and now it no longer works when upgrading. Did you just swap out the image, or did you re-apply the entire agents.yaml? |
I re-applied the whole agents.yaml. This is my current config:
Could it be that that new agent is using some beta API that one needs to get whitelisted first? (just wondering what the v1beta3 refers to in the PR which seemingly introduced this bug). |
Can you please update |
Hi, after increasing the log level this is the error I'm getting
|
This is an oversight in the documentation on our part, the metadata agent agent no longer uses the GCE metadata server to automatically discover the cluster's name and the cluster's location. I confirmed this by looking at the "name" property in the payload you sent.
Please set the values for
After you run this command, you will need to delete the metadata agent pods, which will spawn new pods so that the configmap gets loaded into each pod. You can do this by running:
|
What does location have to be set to? The clusters Zone or Region? |
It could be either, for example |
Can you confirm that this issue has been fixed for you? |
Hi, I just checked again. So the error message actually disappeared after setting those values via the configmap. However none of the metadata (namespaces, pods etc.) show up anymore in the Stackdriver Kubernetes Monitoring UI. This the only log output from the metadata agent (this log output doesn't get any more, even after hours.):
|
That amount of logs is expected. Does the project ID that was in the logs before match the project you're attempting to pull up the Kubernetes Monitoring UI? The project ID is extracted from the credentials.json file that is mounted into the container. |
Hey, not sure if that's what you were asking for, but here's a few things:
Just in case it is actually required to mount service account credentials into the pod I would like to also bring up the concern that the pod spec refers to a ConfigMap instead of a Secret, which is the better and more safe choice in this instance. |
Hi,
In my case I'm using default stackdriver configs provided:
Is this known issue? |
@tadeuszwojcik Have you enabled the Stackdriver API in your project? |
@igorpeshansky thanks, I thought I had, but I must have missed enabling it for new project, it's working great with API enabled! thanks! |
I am running on The Stackdriver API is enabled. Logs are enabled. They are just missing the Based on API metrics, the Service account for GKE is causing a 100% HTTP 400 on the I am not sure where to find more logs to help with debugging. Please let me know. |
Are you installing the metadata agent via GKE, or installing it manually yourself using these configs? If the latter, you'll need to configure the cluster name and cluster location shown in #25 (comment) |
GKE. It was installed when the cluster was created I assume. It uses to work before as older logs had the Metadata. |
Have you enabled the Stackdriver API, as mentioned in #25 (comment) ? |
If you have enabled the Stackdriver API, have you signed up for Stackdriver in the project you've installed the cluster into? You could trigger this by clicking "Monitoring" in the menu inside of Google Cloud Console. |
Yes I have. I can see a 100% error rate on google.cloud.stackdriver.v1beta3.ResourceService.PublishResourceMetadata API call. |
Can you please run this script, and paste the output?
|
Line 81 of your script is truncated.
|
I think I have managed to fix the truncated parts and this is the output:
Looking at the logs of the
|
To clear an assumption, metadata is no longer being written to logs via the metadata labels, the metadata properties are now embedded directly into the log entry. So you should still be able to query by pod labels with values that exist in the log entry. Did you ensure that a Stackdriver Account exists by clicking "Monitoring" inside of your Google Cloud project? |
Correct. It's just that when you click on the
Yes. I have created the necessary workspace too. |
At this point your best avenue is to submit a ticket directly to Cloud Support so that we can triage your clusters directly, it seems as though everything is configured properly, yet you're still getting errors. My only last guess would be if you have a custom network associated to your cluster instead of the default network provided by GKE clusters. |
@lawliet89 what is the location for your cluster? Is it a GCP region (such as us-east1-d) or a GCP zone (such as us-east1)? |
Alright. It happened to two gke clusters at the same time. It's bizarre. It's a regional cluster. (ap-southeast1) |
Alright the bug is confirmed, this is impacting regional clusters due to a bug with how we're sending their identity to the backend. We have a fix in place but it may take a few weeks to actually roll out. The side affect of this is log spam from the metadata agent as you've seen, but the Kubernetes Monitoring UI should still be visible as this data is collected from two separate sources. |
Hi, I have enabled stackdriver api, using a custom service account with my kubernetes cluster with the proper privileges and I still get I have fixed and ran your script it had the following output:
What could be the problem? EDIT: I am seeing 100% error on I have a stackdriver workspace connected to my project. I have added the following roles to my kubernetes service account:
My kubernetes cluster is zonal, private (with master enabled) on a custom vpc network. |
Your custom service account also needs to have the "Stackdriver Resource Metadata Writer" ( |
@igorpeshansky Thanks a lot! I have been debugging this for ages! |
There is an undocumented Role needed (see Stackdriver/kubernetes-configs#25 (comment)) to use the new stackdriver for Kubernetes, else this happens: ``` W0627 11:44:41.308933 1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission W0627 11:44:41.408116 1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission W0627 11:44:41.635170 1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission W0627 11:44:42.107927 1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission W0627 11:44:42.308177 1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission W06 ```
We're running a few GKE clusters which have Stackdriver Monitoring manually installed using the configs from this repo (reason for manual install is mainly to add a few custom log parsing rules to the config).
After upgrading to the latest version of the configs which seem to include
some big changes to the metadata agent, the metadata agent doesn't work anymore and metadata disappears from the Kubernetes Dashboard on Stackdriver monitoring.
The metadata agent prints the following errors:
obtained via:
kubectl logs -n stackdriver-agents stackdriver-metadata-agent-cluster-level-78599b584-wkprj
The config was obtained from this url: https://raw.githubusercontent.com/Stackdriver/kubernetes-configs/stable/agents.yaml
The logging agent continues to work.
Issue seems to have been introduced by #20
The text was updated successfully, but these errors were encountered: