Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata agent not working: Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request #25

Closed
geekflyer opened this issue Mar 15, 2019 · 33 comments

Comments

@geekflyer
Copy link

geekflyer commented Mar 15, 2019

We're running a few GKE clusters which have Stackdriver Monitoring manually installed using the configs from this repo (reason for manual install is mainly to add a few custom log parsing rules to the config).

After upgrading to the latest version of the configs which seem to include
some big changes to the metadata agent, the metadata agent doesn't work anymore and metadata disappears from the Kubernetes Dashboard on Stackdriver monitoring.

The metadata agent prints the following errors:
obtained via: kubectl logs -n stackdriver-agents stackdriver-metadata-agent-cluster-level-78599b584-wkprj

W0315 01:48:28.766316       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0315 01:48:28.783876       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0315 01:48:28.934092       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request

The config was obtained from this url: https://raw.githubusercontent.com/Stackdriver/kubernetes-configs/stable/agents.yaml

The logging agent continues to work.

Issue seems to have been introduced by #20

@bmoyles0117
Copy link
Contributor

In order for the metadata agent to work, your project must be signed up for Stackdriver, and have the Stackdriver API enabled in the Google Cloud console. Can you please confirm that you've done both of these?

@bmoyles0117
Copy link
Contributor

Oh, I just realized you're saying it used to work, and now it no longer works when upgrading. Did you just swap out the image, or did you re-apply the entire agents.yaml?

@geekflyer
Copy link
Author

I re-applied the whole agents.yaml.

This is my current config:

# THIS FILE IS AUTO-GENERATED DO NOT EDIT
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    k8s-app: stackdriver-heapster
    version: v1.5.3
  name: heapster
  namespace: stackdriver-agents
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: stackdriver-heapster
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: stackdriver-heapster
        version: v1.5.3
    spec:
      containers:
      - env:
        - name: CLUSTER_NAME
          valueFrom:
            configMapKeyRef:
              name: cluster-config
              key: cluster_name
        - name: CLUSTER_LOCATION
          valueFrom:
            configMapKeyRef:
              name: cluster-config
              key: cluster_location
        - name: GOOGLE_APPLICATION_CREDENTIALS
          valueFrom:
            configMapKeyRef:
              name: google-cloud-config
              key: credentials_path
        command:
        - /heapster
        - --source=kubernetes.summary_api:''
        - --sink=stackdriver:?cluster_name=$(CLUSTER_NAME)&cluster_location=$(CLUSTER_LOCATION)&zone=$(CLUSTER_LOCATION)&use_old_resources=false&use_new_resources=true&min_interval_sec=100&batch_export_timeout_sec=110
        image: gcr.io/stackdriver-agents/stackdriver-heapster-amd64:v1.5.0-beta.3
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8082
            scheme: HTTP
          initialDelaySeconds: 180
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: heapster
        resources:
          limits:
            cpu: 88m
            memory: 204Mi
          requests:
            cpu: 88m
            memory: 204Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/google-cloud/
          name: google-cloud-config
      - command:
        - /pod_nanny
        - --cpu=80m
        - --extra-cpu=0.5m
        - --memory=140Mi
        - --extra-memory=4Mi
        - --threshold=5
        - --deployment=heapster
        - --container=heapster
        - --poll-period=300000
        - --estimator=exponential
        env:
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: gcr.io/google_containers/addon-resizer:1.7
        imagePullPolicy: IfNotPresent
        name: heapster-nanny
        resources:
          limits:
            cpu: 50m
            memory: 112360Ki
          requests:
            cpu: 50m
            memory: 112360Ki
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: heapster
      serviceAccountName: heapster
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: google-cloud-config
        name: google-cloud-config

---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: stackdriver-logging-agent
  name: stackdriver-logging-agent
  namespace: stackdriver-agents
spec:
  selector:
    matchLabels:
      app: stackdriver-logging-agent
  template:
    metadata:
      labels:
        app: stackdriver-logging-agent
    spec:
      containers:
      - env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: GOOGLE_APPLICATION_CREDENTIALS
          valueFrom:
            configMapKeyRef:
              name: google-cloud-config
              key: credentials_path
        - name: CLUSTER_NAME
          valueFrom:
            configMapKeyRef:
              name: cluster-config
              key: cluster_name
        - name: CLUSTER_LOCATION
          valueFrom:
            configMapKeyRef:
              name: cluster-config
              key: cluster_location
        image: gcr.io/stackdriver-agents/stackdriver-logging-agent:1.6.4
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300}; STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900}; if [ ! -e /var/log/k8s-fluentd-buffers ]; then
                exit 1;
              fi; touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck; if [[ -z "$(find /var/log/k8s-fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit)" ]]; then
                rm -rf /var/log/fluentd-buffers;
                exit 1;
              fi; touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness; if [[ -z "$(find /var/log/k8s-fluentd-buffers -type f -newer /tmp/marker-liveness -print -quit)" ]]; then
                exit 1;
              fi;
          failureThreshold: 3
          initialDelaySeconds: 600
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 1
        name: logging-agent
        resources:
          limits:
            cpu: "1"
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 200Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/log
          name: varlog
        - mountPath: /var/lib/docker/containers
          name: varlibdockercontainers
          readOnly: true
        - mountPath: /etc/google-fluentd/config.d
          name: config-volume
        - mountPath: /etc/google-cloud/
          name: google-cloud-config
      serviceAccount: logging-agent
      serviceAccountName: logging-agent
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"
      volumes:
      - hostPath:
          path: /var/log
          type: ""
        name: varlog
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: varlibdockercontainers
      - configMap:
          defaultMode: 420
          name: logging-agent-config
        name: config-volume
      - configMap:
          defaultMode: 420
          name: google-cloud-config
        name: google-cloud-config
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
---
apiVersion: v1
data:
  containers.input.conf: |-
    # This configuration file for Fluentd is used
    # to watch changes to Docker log files that live in the
    # directory /var/lib/docker/containers/ and are symbolically
    # linked to from the /var/log/containers directory using names that capture the
    # pod name and container name. These logs are then submitted to
    # Google Cloud Logging which assumes the installation of the cloud-logging plug-in.
    #
    # Example
    # =======
    # A line in the Docker log file might look like this JSON:
    #
    # {"log":"2014/09/25 21:15:03 Got request with path wombat\\n",
    #  "stream":"stderr",
    #   "time":"2014-09-25T21:15:03.499185026Z"}
    #
    # The record reformer is used to write the tag to focus on the pod name
    # and the Kubernetes container name. For example a Docker container's logs
    # might be in the directory:
    #  /var/lib/docker/containers/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b
    # and in the file:
    #  997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b-json.log
    # where 997599971ee6... is the Docker ID of the running container.
    # The Kubernetes kubelet makes a symbolic link to this file on the host machine
    # in the /var/log/containers directory which includes the pod name and the Kubernetes
    # container name:
    #    synthetic-logger-0.25lps-pod_default-synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #    ->
    #    /var/lib/docker/containers/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b-json.log
    # The /var/log directory on the host is mapped to the /var/log directory in the container
    # running this instance of Fluentd and we end up collecting the file:
    #   /var/log/containers/synthetic-logger-0.25lps-pod_default-synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    # This results in the tag:
    #  var.log.containers.synthetic-logger-0.25lps-pod_default-synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    # The record reformer is used is discard the var.log.containers prefix and
    # the Docker container ID suffix and "kubernetes." is pre-pended giving the tag:
    #   kubernetes.synthetic-logger-0.25lps-pod_default-synth-lgr
    # Tag is then parsed by google_cloud plugin and translated to the metadata,
    # visible in the log viewer

    # Example:
    # {"log":"[info:2016-02-16T16:04:05.930-08:00] Some log text here\n","stream":"stdout","time":"2016-02-17T00:04:05.931087621Z"}
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/k8s-gcp-containers.log.pos
      tag reform.*
      read_from_head true
      <parse>
        @type multi_format
        <pattern>
          format json
          time_key time
          time_format %Y-%m-%dT%H:%M:%S.%NZ
        </pattern>
        <pattern>
          format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
          time_format %Y-%m-%dT%H:%M:%S.%N%:z
        </pattern>
      </parse>
    </source>

    <filter reform.**>
      @type parser
      format /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<log>.*)/
      reserve_data true
      suppress_parse_error_log true
      emit_invalid_record_to_error false
      key_name log
    </filter>

    <filter reform.**>
      # This plugin uses environment variables KUBERNETES_SERVICE_HOST and
      # KUBERNETES_SERVICE_PORT to talk to the API server. These environment
      # variables are added by kubelet automatically.
      @type kubernetes_metadata
      # Interval in seconds to dump cache stats locally in the Fluentd log.
      stats_interval 300
      # TTL in seconds of each cached element.
      cache_ttl 30
    </filter>

    <filter reform.**>
      # We have to use record_modifier because only this plugin supports complex
      # logic to modify record the way we need.
      @type record_modifier
      enable_ruby true
      <record>
        # Extract "kubernetes"->"labels" and set them as
        # "logging.googleapis.com/labels". Prefix these labels with
        # "k8s-pod-labels" to distinguish with other labels and avoid
        # label name collision with other types of labels.
        _dummy_ ${if record.is_a?(Hash) && record.has_key?('kubernetes') && record['kubernetes'].has_key?('labels') && record['kubernetes']['labels'].is_a?(Hash); then; record["logging.googleapis.com/labels"] = record['kubernetes']['labels'].map{ |k, v| ["k8s-pod-label/#{k}", v]}.to_h; end; nil}
      </record>
      # Delete this dummy field and the rest of "kubernetes" and "docker".
      remove_keys _dummy_,kubernetes,docker
    </filter>

    <match reform.**>
      @type record_reformer
      enable_ruby true
      <record>
        # Extract local_resource_id from tag for 'k8s_container' monitored
        # resource. The format is:
        # 'k8s_container.<namespace_name>.<pod_name>.<container_name>'.
        "logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"}
        # Rename the field 'log' to a more generic field 'message'. This way the
        # fluent-plugin-google-cloud knows to flatten the field as textPayload
        # instead of jsonPayload after extracting 'time', 'severity' and
        # 'stream' from the record.
        message ${record['log']}
        # SOLVVY SPECIFIC CONFIG: If record contains `solvvyLogLevel` map this to stackdriver log levels. 
        # If 'severity' is not set, assume stderr is ERROR and stdout is INFO.
        severity ${case record["solvvyLogLevel"]; when "TRACE"; "DEBUG"; when "WARN"; "WARNING"; else; record["solvvyLogLevel"]; end || record['severity'] || if record['stream'] == 'stderr' then 'ERROR' else 'INFO' end}
      </record>
      tag ${if record['stream'] == 'stderr' then 'raw.stderr' else 'raw.stdout' end}
      remove_keys stream,log
    </match>

    # Detect exceptions in the log output and forward them as one log entry.
    <match {raw.stderr,raw.stdout}>
      @type detect_exceptions

      remove_tag_prefix raw
      message message
      stream "logging.googleapis.com/local_resource_id"
      multiline_flush_interval 5
      max_bytes 500000
      max_lines 1000
    </match>
  system.input.conf: |-
    # Example:
    # Dec 21 23:17:22 gke-foo-1-1-4b5cbd14-node-4eoj startupscript: Finished running startup script /var/run/google.startup.script
    <source>
      @type tail
      format syslog
      path /var/log/startupscript.log
      pos_file /var/log/k8s-gcp-startupscript.log.pos
      tag startupscript
    </source>

    # Examples:
    # time="2016-02-04T06:51:03.053580605Z" level=info msg="GET /containers/json"
    # time="2016-02-04T07:53:57.505612354Z" level=error msg="HTTP Error" err="No such image: -f" statusCode=404
    # TODO(random-liu): Remove this after cri container runtime rolls out.
    <source>
      @type tail
      format /^time="(?<time>[^)]*)" level=(?<severity>[^ ]*) msg="(?<message>[^"]*)"( err="(?<error>[^"]*)")?( statusCode=($<status_code>\d+))?/
      path /var/log/docker.log
      pos_file /var/log/k8s-gcp-docker.log.pos
      tag docker
    </source>

    # Example:
    # 2016/02/04 06:52:38 filePurge: successfully removed file /var/etcd/data/member/wal/00000000000006d0-00000000010a23d1.wal
    <source>
      @type tail
      # Not parsing this, because it doesn't have anything particularly useful to
      # parse out of it (like severities).
      format none
      path /var/log/etcd.log
      pos_file /var/log/k8s-gcp-etcd.log.pos
      tag etcd
    </source>

    # Multi-line parsing is required for all the kube logs because very large log
    # statements, such as those that include entire object bodies, get split into
    # multiple lines by glog.

    # Example:
    # I0204 07:32:30.020537    3368 server.go:1048] POST /stats/container/: (13.972191ms) 200 [[Go-http-client/1.1] 10.244.1.3:40537]
    <source>
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/kubelet.log
      pos_file /var/log/k8s-gcp-kubelet.log.pos
      tag kubelet
    </source>

    # Example:
    # I1118 21:26:53.975789       6 proxier.go:1096] Port "nodePort for kube-system/default-http-backend:http" (:31429/tcp) was open before and is still needed
    <source>
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/kube-proxy.log
      pos_file /var/log/k8s-gcp-kube-proxy.log.pos
      tag kube-proxy
    </source>

    # Example:
    # I0204 07:00:19.604280       5 handlers.go:131] GET /api/v1/nodes: (1.624207ms) 200 [[kube-controller-manager/v1.1.3 (linux/amd64) kubernetes/6a81b50] 127.0.0.1:38266]
    <source>
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/kube-apiserver.log
      pos_file /var/log/k8s-gcp-kube-apiserver.log.pos
      tag kube-apiserver
    </source>

    # Example:
    # I0204 06:55:31.872680       5 servicecontroller.go:277] LB already exists and doesn't need update for service kube-system/kube-ui
    <source>
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/kube-controller-manager.log
      pos_file /var/log/k8s-gcp-kube-controller-manager.log.pos
      tag kube-controller-manager
    </source>

    # Example:
    # W0204 06:49:18.239674       7 reflector.go:245] pkg/scheduler/factory/factory.go:193: watch of *api.Service ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [2578313/2577886]) [2579312]
    <source>
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/kube-scheduler.log
      pos_file /var/log/k8s-gcp-kube-scheduler.log.pos
      tag kube-scheduler
    </source>

    # Example:
    # I1104 10:36:20.242766       5 rescheduler.go:73] Running Rescheduler
    <source>
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/rescheduler.log
      pos_file /var/log/k8s-gcp-rescheduler.log.pos
      tag rescheduler
    </source>

    # Example:
    # I0603 15:31:05.793605       6 cluster_manager.go:230] Reading config from path /etc/gce.conf
    <source>
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/glbc.log
      pos_file /var/log/k8s-gcp-glbc.log.pos
      tag glbc
    </source>

    # Example:
    # I0603 15:31:05.793605       6 cluster_manager.go:230] Reading config from path /etc/gce.conf
    <source>
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/cluster-autoscaler.log
      pos_file /var/log/k8s-gcp-cluster-autoscaler.log.pos
      tag cluster-autoscaler
    </source>

    # Logs from systemd-journal for interesting services.
    # TODO(random-liu): Keep this for compatibility, remove this after
    # cri container runtime rolls out.
    <source>
      @type systemd
      filters [{ "_SYSTEMD_UNIT": "docker.service" }]
      pos_file /var/log/k8s-gcp-journald-docker.pos
      read_from_head true
      tag docker
    </source>

    <source>
      @type systemd
      filters [{ "_SYSTEMD_UNIT": "{{ container_runtime }}.service" }]
      pos_file /var/log/k8s-gcp-journald-container-runtime.pos
      read_from_head true
      tag container-runtime
    </source>

    <source>
      @type systemd
      filters [{ "_SYSTEMD_UNIT": "kubelet.service" }]
      pos_file /var/log/k8s-gcp-journald-kubelet.pos
      read_from_head true
      tag kubelet
    </source>

    <source>
      @type systemd
      filters [{ "_SYSTEMD_UNIT": "node-problem-detector.service" }]
      pos_file /var/log/k8s-gcp-journald-node-problem-detector.pos
      read_from_head true
      tag node-problem-detector
    </source>

    # BEGIN_NODE_JOURNAL
    # Whether to include node-journal or not is determined when starting the
    # cluster. It is not changed when the cluster is already running.
    <source>
      @type systemd
      pos_file /var/log/k8s-gcp-journald.pos
      read_from_head true
      tag node-journal
    </source>

    <filter node-journal>
      @type grep
      <exclude>
        key _SYSTEMD_UNIT
        pattern ^(docker|{{ container_runtime }}|kubelet|node-problem-detector)\.service$
      </exclude>
    </filter>
    # END_NODE_JOURNAL
  monitoring.conf: |-
    # This source is used to acquire approximate process start timestamp,
    # which purpose is explained before the corresponding output plugin.
    <source>
      @type exec
      command /bin/sh -c 'date +%s'
      tag process_start
      time_format %Y-%m-%d %H:%M:%S
      keys process_start_timestamp
    </source>

    # This filter is used to convert process start timestamp to integer
    # value for correct ingestion in the prometheus output plugin.
    <filter process_start>
      @type record_transformer
      enable_ruby true
      auto_typecast true
      <record>
        process_start_timestamp ${record["process_start_timestamp"].to_i}
      </record>
    </filter>
  output.conf: |-
    # This match is placed before the all-matching output to provide metric
    # exporter with a process start timestamp for correct exporting of
    # cumulative metrics to Stackdriver.
    <match process_start>
      @type prometheus

      <metric>
        type gauge
        name process_start_time_seconds
        desc Timestamp of the process start in seconds
        key process_start_timestamp
      </metric>
    </match>

    # This filter allows to count the number of log entries read by fluentd
    # before they are processed by the output plugin. This in turn allows to
    # monitor the number of log entries that were read but never sent, e.g.
    # because of liveness probe removing buffer.
    <filter **>
      @type prometheus
      <metric>
        type counter
        name logging_entry_count
        desc Total number of log entries generated by either application containers or system components
      </metric>
    </filter>

    # This section is exclusive for k8s_container logs. Those come with
    # 'stderr'/'stdout' tags.
    # TODO(instrumentation): Reconsider this workaround later.
    # Trim the entries which exceed slightly less than 100KB, to avoid
    # dropping them. It is a necessity, because Stackdriver only supports
    # entries that are up to 100KB in size.
    <filter {stderr,stdout}>
      @type record_transformer
      enable_ruby true
      <record>
        message ${record['message'].length > 100000 ? "[Trimmed]#{record['message'][0..100000]}..." : record['message']}
      </record>
    </filter>

    # Do not collect fluentd's own logs to avoid infinite loops.
    <match fluent.**>
      @type null
    </match>

    # Add a unique insertId to each log entry that doesn't already have it.
    # This helps guarantee the order and prevent log duplication.
    <filter **>
      @type add_insert_ids
    </filter>

    # This filter parses the 'source' field created for glog lines into a single
    # top-level field, for proper processing by the output plugin.
    # For example, if a record includes:
    #     {"source":"handlers.go:131"},
    # then the following entry will be added to the record:
    #     {"logging.googleapis.com/sourceLocation":
    #          {"file":"handlers.go", "line":"131"}
    #     }
    <filter **>
      @type record_transformer
      enable_ruby true
      <record>
        "logging.googleapis.com/sourceLocation" ${if record.is_a?(Hash) && record.has_key?('source'); source_parts = record['source'].split(':', 2); {'file' => source_parts[0], 'line' => source_parts[1]} if source_parts.length == 2; else; nil; end}
      </record>
    </filter>

    # This section is exclusive for k8s_container logs. These logs come with
    # 'stderr'/'stdout' tags.
    # We use a separate output stanza for 'k8s_node' logs with a smaller buffer
    # because node logs are less important than user's container logs.
    <match {stderr,stdout}>
      @type google_cloud

      # Try to detect JSON formatted log entries.
      detect_json true
      # Collect metrics in Prometheus registry about plugin activity.
      enable_monitoring true
      monitoring_type prometheus
      # Allow log entries from multiple containers to be sent in the same request.
      split_logs_by_tag false
      # Set the buffer type to file to improve the reliability and reduce the memory consumption
      buffer_type file
      buffer_path /var/log/k8s-fluentd-buffers/kubernetes.containers.buffer
      # Set queue_full action to block because we want to pause gracefully
      # in case of the off-the-limits load instead of throwing an exception
      buffer_queue_full_action block
      # Set the chunk limit conservatively to avoid exceeding the recommended
      # chunk size of 5MB per write request.
      buffer_chunk_limit 512k
      # Cap the combined memory usage of this buffer and the one below to
      # 1MiB/chunk * (6 + 2) chunks = 8 MiB
      buffer_queue_limit 6
      # Never wait more than 5 seconds before flushing logs in the non-error case.
      flush_interval 5s
      # Never wait longer than 30 seconds between retries.
      max_retry_wait 30
      # Disable the limit on the number of retries (retry forever).
      disable_retry_limit
      # Use multiple threads for processing.
      num_threads 2
      use_grpc true
      k8s_cluster_name "#{ENV["CLUSTER_NAME"]}"
      k8s_cluster_location "#{ENV["CLUSTER_LOCATION"]}"
      adjust_invalid_timestamps false
    </match>

    # Attach local_resource_id for 'k8s_node' monitored resource.
    <filter **>
      @type record_transformer
      enable_ruby true
      <record>
        "logging.googleapis.com/local_resource_id" ${"k8s_node.#{ENV['NODE_NAME']}"}
      </record>
    </filter>

    # This section is exclusive for 'k8s_node' logs. These logs come with tags
    # that are neither 'stderr' or 'stdout'.
    # We use a separate output stanza for 'k8s_container' logs with a larger
    <match **>
      @type google_cloud

      detect_json true
      enable_monitoring true
      monitoring_type prometheus
      # Allow entries from multiple system logs to be sent in the same request.
      split_logs_by_tag false
      detect_subservice false
      buffer_type file
      buffer_path /var/log/k8s-fluentd-buffers/kubernetes.system.buffer
      buffer_queue_full_action block
      buffer_chunk_limit 512k
      buffer_queue_limit 2
      flush_interval 5s
      max_retry_wait 30
      disable_retry_limit
      num_threads 2
      use_grpc true
      k8s_cluster_name "#{ENV["CLUSTER_NAME"]}"
      k8s_cluster_location "#{ENV["CLUSTER_LOCATION"]}"
      adjust_invalid_timestamps false
    </match>
kind: ConfigMap
metadata:
  name: logging-agent-config
  namespace: stackdriver-agents

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: stackdriver-metadata-agent
    cluster-level: "true"
  name: stackdriver-metadata-agent-cluster-level
  namespace: stackdriver-agents
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stackdriver-metadata-agent
      cluster-level: "true"
  template:
    metadata:
      labels:
        app: stackdriver-metadata-agent
        cluster-level: "true"
    spec:
      containers:
      - env:
        - name: CLUSTER_NAME
          valueFrom:
            configMapKeyRef:
              name: cluster-config
              key: cluster_name
        - name: CLUSTER_LOCATION
          valueFrom:
            configMapKeyRef:
              name: cluster-config
              key: cluster_location
        - name: GOOGLE_APPLICATION_CREDENTIALS
          valueFrom:
            configMapKeyRef:
              name: google-cloud-config
              key: credentials_path
        - name: PROMETHEUS_PORT
          value: "8888"
        args:
        - -logtostderr
        - -v=1
        image: gcr.io/stackdriver-agents/metadata-agent-go:1.0.1
        imagePullPolicy: IfNotPresent
        name: metadata-agent
        resources:
          requests:
            cpu: 40m
            memory: 50Mi
        ports:
        - name: metadata-agent
          containerPort: 8888
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/google-cloud/
          name: google-cloud-config
        - mountPath: /etc/ssl/certs
          name: ssl-certs
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: metadata-agent
      serviceAccountName: metadata-agent
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"
      terminationGracePeriodSeconds: 5
      volumes:
      - configMap:
          defaultMode: 420
          name: google-cloud-config
        name: google-cloud-config
      - hostPath:
          path: /etc/ssl/certs
          type: Directory
        name: ssl-certs
  strategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

---

Could it be that that new agent is using some beta API that one needs to get whitelisted first? (just wondering what the v1beta3 refers to in the PR which seemingly introduced this bug).

@bmoyles0117
Copy link
Contributor

Can you please update -v=1 to -v=2, this way we can see the payloads that are being sent and ensure that the project ID, cluster name, and cluster location are being sent properly. Once you've done this, please send an example line that includes the payload being sent.

@geekflyer
Copy link
Author

Hi, after increasing the log level this is the error I'm getting

I0320 12:18:38.054715       1 resource_metadata.go:125] Sending metadata payload - {"name":"//container.googleapis.com/projects/redacted-production/locations//clusters//k8s/namespaces/predict-v3-prod/extensions/replicasets/resolve-org-265-8674d8f658","type":"io.k8s.extensions.ReplicaSet","location":"//cloud.google.com/locations/locations/","state":"EXISTS","eventTime":"2019-03-20T12:18:38Z","views":{"v1beta1":{"schemaName":"//container.googleapis.com/resourceTypes/io.k8s.extensions.ReplicaSet/versions/v1beta1","stringContent":"{\"metadata\":{\"name\":\"resolve-org-265-8674d8f658\",\"namespace\":\"predict-v3-prod\",\"selfLink\":\"/apis/extensions/v1beta1/namespaces/predict-v3-prod/replicasets/resolve-org-265-8674d8f658\",\"uid\":\"bf28aecf-05b8-11e9-823c-42010a8a007b\",\"resourceVersion\":\"43657041\",\"generation\":2,\"creationTimestamp\":\"2018-12-22T07:11:05Z\",\"labels\":{\"app\":\"resolve\",\"orgId\":\"265\",\"pod-template-hash\":\"4230849214\",\"version\":\"2018-12-21-224558\"},\"annotations\":{\"deployment.kubernetes.io/desired-replicas\":\"1\",\"deployment.kubernetes.io/max-replicas\":\"2\",\"deployment.kubernetes.io/revision\":\"2\"},\"ownerReferences\":[{\"apiVersion\":\"extensions/v1beta1\",\"kind\":\"Deployment\",\"name\":\"resolve-org-265\",\"uid\":\"44ffb718-042d-11e9-823c-42010a8a007b\",\"controller\":true,\"blockOwnerDeletion\":true}]},\"spec\":{\"replicas\":0,\"selector\":{\"matchLabels\":{\"app\":\"resolve\",\"orgId\":\"265\",\"pod-template-hash\":\"4230849214\"}},\"template\":{\"metadata\":{\"creationTimestamp\":null,\"labels\":{\"app\":\"resolve\",\"orgId\":\"265\",\"pod-template-hash\":\"4230849214\",\"version\":\"2018-12-21-224558\"}},\"spec\":{\"containers\":[{\"name\":\"resolve\",\"image\":\"gcr.io/redacted-production/resolve-predict-server:07953530\",\"args\":[\"--port=80\",\"--remote_model_dir=gs://redacted-pipeline\",\"--initial_model_version=2018-12-21-224558\",\"--local_model_dir=/data\"],\"ports\":[{\"name\":\"http\",\"containerPort\":80,\"protocol\":\"TCP\"}],\"env\":[{\"name\":\"USE_STACKDRIVER_TRACING\",\"value\":\"True\"},{\"name\":\"ORG_ID\",\"value\":\"265\"},{\"name\":\"REMOTE_MODEL_DIR\",\"value\":\"gs://redacted-pipeline\"},{\"name\":\"INITIAL_MODEL_VERSION\",\"value\":\"2018-12-21-224558\"},{\"name\":\"PORT\",\"value\":\"80\"},{\"name\":\"NUM_CPUS\",\"value\":\"500m\"}],\"resources\":{\"requests\":{\"cpu\":\"500m\",\"memory\":\"2000Mi\"}},\"livenessProbe\":{\"httpGet\":{\"path\":\"/health\",\"port\":\"http\",\"scheme\":\"HTTP\"},\"initialDelaySeconds\":5,\"timeoutSeconds\":1,\"periodSeconds\":60,\"successThreshold\":1,\"failureThreshold\":3},\"readinessProbe\":{\"httpGet\":{\"path\":\"/health\",\"port\":\"http\",\"scheme\":\"HTTP\"},\"initialDelaySeconds\":5,\"timeoutSeconds\":1,\"periodSeconds\":20,\"successThreshold\":1,\"failureThreshold\":3},\"lifecycle\":{\"postStart\":{\"exec\":{\"command\":[\"/bin/sh\",\"-c\",\"sleep 10 \\u0026\\u0026 curl -X POST http://localhost:80/generic_scikit/resolve -H 'cache-control: no-cache' -H 'content-type:application/json' -d '{\\\"limit\\\":2,\\\"org_id\\\": 265, \\\"query\\\": \\\"how do I reset my password?\\\"}' || true\"]}}},\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"}],\"restartPolicy\":\"Always\",\"terminationGracePeriodSeconds\":30,\"dnsPolicy\":\"ClusterFirst\",\"securityContext\":{},\"schedulerName\":\"default-scheduler\"}}},\"status\":{\"replicas\":0,\"observedGeneration\":2}}"}}}
I0320 12:18:38.060321       1 resource_metadata.go:151] Received response - 400 Bad Request - {
  "error": {
    "code": 400,
    "message": "Request contains an invalid argument.",
    "status": "INVALID_ARGUMENT"
  }
}
W0320 12:18:38.060359       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request

@bmoyles0117
Copy link
Contributor

bmoyles0117 commented Mar 20, 2019

This is an oversight in the documentation on our part, the metadata agent agent no longer uses the GCE metadata server to automatically discover the cluster's name and the cluster's location.

I confirmed this by looking at the "name" property in the payload you sent.

//container.googleapis.com/projects/redacted-production/locations//clusters//k8s/namespaces/predict-v3-prod/extensions/replicasets/resolve-org-265-8674d8f658
                                                                  ^ empty cluster location
                                                                            ^ empty cluster name

Please set the values for CLUSTER_NAME and CLUSTER_LOCATION to match what you see inside of the Google Cloud Console for this cluster.

CLUSTER_NAME=[CLUSTER_NAME]
CLUSTER_LOCATION=[CLUSTER_LOCATION]

kubectl create configmap -o yaml --dry-run \
  --namespace=stackdriver-agents \
  cluster-config \
  --from-literal="cluster_name=${CLUSTER_NAME}" \
  --from-literal="cluster_location=${CLUSTER_LOCATION}" \
  | kubectl apply -f -

After you run this command, you will need to delete the metadata agent pods, which will spawn new pods so that the configmap gets loaded into each pod. You can do this by running:

kubectl delete pods -n stackdriver-agents -l app=stackdriver-metadata-agent

@geekflyer
Copy link
Author

geekflyer commented Mar 20, 2019

What does location have to be set to? The clusters Zone or Region?

@bmoyles0117
Copy link
Contributor

It could be either, for example us-east1-d or us-east1. As long as it's a valid GCP zone / region.

@bmoyles0117
Copy link
Contributor

Can you confirm that this issue has been fixed for you?

@geekflyer
Copy link
Author

Hi, I just checked again. So the error message actually disappeared after setting those values via the configmap. However none of the metadata (namespaces, pods etc.) show up anymore in the Stackdriver Kubernetes Monitoring UI.

This the only log output from the metadata agent (this log output doesn't get any more, even after hours.):

➜  ~ k logs stackdriver-metadata-agent-cluster-level-78599b584-kg8d2 
I0323 00:51:23.161614       1 log_spam.go:42] Command line arguments:
I0323 00:51:23.161680       1 log_spam.go:44]  argv[0]: '/app/cloud/monitoring/agents/k8s_metadata/k8s_metadata'
I0323 00:51:23.161686       1 log_spam.go:44]  argv[1]: '-logtostderr'
I0323 00:51:23.161690       1 log_spam.go:44]  argv[2]: '-v=1'
I0323 00:51:23.161695       1 log_spam.go:46] Process id 1
I0323 00:51:23.161723       1 log_spam.go:50] Current working directory /app
I0323 00:51:23.161778       1 log_spam.go:52] Built on Feb 26 22:27:49 (1551220069)
 at [email protected]:/google/src/cloud/dbtucker/clean/google3
 as //cloud/monitoring/agents/k8s_metadata:k8s_metadata
 with gc go1.12rc1 for linux/amd64
 from changelist 235793440 with version map 235793440 default { // } in a unknown client based on //depot/google3 in CitC workspace dbtucker:clean:3503:citc at snapshot 2
Build tool: Blaze, release blaze-2019.02.21-2 (mainline @234891966)
Build target: //cloud/monitoring/agents/k8s_metadata:k8s_metadata
W0323 00:51:23.165823       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0323 00:51:23.166217       1 main.go:74] Initiating watch for { v1 nodes} resources
I0323 00:51:23.166320       1 main.go:74] Initiating watch for { v1 pods} resources
I0323 00:51:23.166376       1 main.go:74] Initiating watch for {batch v1beta1 cronjobs} resources
I0323 00:51:23.166416       1 main.go:74] Initiating watch for {apps v1 daemonsets} resources
I0323 00:51:23.166485       1 main.go:74] Initiating watch for {extensions v1beta1 daemonsets} resources
I0323 00:51:23.166533       1 main.go:74] Initiating watch for {apps v1 deployments} resources
I0323 00:51:23.166611       1 main.go:74] Initiating watch for {extensions v1beta1 deployments} resources
I0323 00:51:23.166665       1 main.go:74] Initiating watch for { v1 endpoints} resources
I0323 00:51:23.166736       1 main.go:74] Initiating watch for {extensions v1beta1 ingresses} resources
I0323 00:51:23.166781       1 main.go:74] Initiating watch for {batch v1 jobs} resources
I0323 00:51:23.166818       1 main.go:74] Initiating watch for { v1 namespaces} resources
I0323 00:51:23.166862       1 main.go:74] Initiating watch for {apps v1 replicasets} resources
I0323 00:51:23.166949       1 main.go:74] Initiating watch for {extensions v1beta1 replicasets} resources
I0323 00:51:23.166983       1 main.go:74] Initiating watch for { v1 replicationcontrollers} resources
I0323 00:51:23.167020       1 main.go:74] Initiating watch for { v1 services} resources
I0323 00:51:23.167051       1 main.go:74] Initiating watch for {apps v1 statefulsets} resources
I0323 00:51:23.167060       1 main.go:82] All resources are being watched, agent has started successfully

@bmoyles0117
Copy link
Contributor

That amount of logs is expected. Does the project ID that was in the logs before match the project you're attempting to pull up the Kubernetes Monitoring UI? The project ID is extracted from the credentials.json file that is mounted into the container.

@geekflyer
Copy link
Author

geekflyer commented Mar 26, 2019

Hey, not sure if that's what you were asking for, but here's a few things:

  1. There is no project id mentioned anywhere in the agent logs after the deploying the new agent.
  2. Logs in Google Stackdriver Logging are still showing up under the correct project and the correct clustername, even with the new agent config.
  3. I'm not mounting a credentials.json into the container (is this another gap in the docs)? I can see that the yaml template contains a GOOGLE_APPLICATON_CREDENTIALS reference, which seemingly refers to some service account credentials but I thought this setup was optional, especially considering that the logging still works and also considering that the cluster nodes already have implicit oauth scopes which should be sufficient to send their data to Stackdriver Monitoring:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          valueFrom:
            configMapKeyRef:
              name: google-cloud-config
              key: credentials_path

Just in case it is actually required to mount service account credentials into the pod I would like to also bring up the concern that the pod spec refers to a ConfigMap instead of a Secret, which is the better and more safe choice in this instance.

@tadeuszwojcik
Copy link

Hi,
I've updated GKE to 1.12.6-gke.7 and got the same error over and over in logs:

Failed to publish resource metadata: Unexpected response code from confluence: 403 Forbidden

In my case I'm using default stackdriver configs provided:

Stackdriver Logging	Enabled v2(beta)
Stackdriver Monitoring	Enabled v2(beta)

Is this known issue?
Thanks

@igorpeshansky
Copy link
Member

igorpeshansky commented Mar 27, 2019

@tadeuszwojcik Have you enabled the Stackdriver API in your project?

@tadeuszwojcik
Copy link

@igorpeshansky thanks, I thought I had, but I must have missed enabling it for new project, it's working great with API enabled! thanks!

@lawliet89
Copy link

lawliet89 commented Apr 1, 2019

I am running on GKE 1.12.6-gke.7 and since Friday ~ 19:00 UTC+8, I've been having the error Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request.

The Stackdriver API is enabled. Logs are enabled. They are just missing the metadata field since the stackdriver-metadata-agent-cluster-level pod is not working.

Based on API metrics, the Service account for GKE is causing a 100% HTTP 400 on the google.cloud.stackdriver.v1beta3.ResourceService.PublishResourceMetadata API call.

I am not sure where to find more logs to help with debugging. Please let me know.

@bmoyles0117
Copy link
Contributor

Are you installing the metadata agent via GKE, or installing it manually yourself using these configs? If the latter, you'll need to configure the cluster name and cluster location shown in #25 (comment)

@lawliet89
Copy link

lawliet89 commented Apr 1, 2019

GKE. It was installed when the cluster was created I assume. It uses to work before as older logs had the Metadata.

@bmoyles0117
Copy link
Contributor

Have you enabled the Stackdriver API, as mentioned in #25 (comment) ?

@bmoyles0117
Copy link
Contributor

If you have enabled the Stackdriver API, have you signed up for Stackdriver in the project you've installed the cluster into? You could trigger this by clicking "Monitoring" in the menu inside of Google Cloud Console.

@lawliet89
Copy link

Yes I have. I can see a 100% error rate on google.cloud.stackdriver.v1beta3.ResourceService.PublishResourceMetadata API call.

@bmoyles0117
Copy link
Contributor

Can you please run this script, and paste the output?

#!/bin/bash

# This script inspects a GKE cluster's Stackdriver integration to help users
# understand how the agents were installed, and which resource model the agents
# are configured to send data to. The goal of this diagnostic script is expand
# on details of Stackdriver workloads so that we can provide better support for
# our users.
#
# Example Usage:
# PROJECT=[PROJECT] CLUSTER_NAME=[CLUSTER_NAME] CLUSTER_LOCATION=[CLUSTER_LOCATION] bash diagnose-cluster.sh

set -e

if [[ -z "${PROJECT}" ]] || [[ -z "${CLUSTER_NAME}" ]] || [[ -z "${CLUSTER_LOCATION}" ]]; then
  echo "Example usage: PROJECT=[PROJECT] CLUSTER_NAME=[CLUSTER_NAME] CLUSTER_LOCATION=[CLUSTER_LOCATION] bash diagnose-cluster.sh"
  exit 1
fi

PROJECT="${PROJECT}"
CLUSTER_NAME="${CLUSTER_NAME}"
CLUSTER_LOCATION="${CLUSTER_LOCATION}"
LEGACY_RESOURCE_MODEL="gke"
NEW_RESOURCE_MODEL="k8s"

gcloud --project "${PROJECT}" container clusters get-credentials "${CLUSTER_NAME}" --zone "${CLUSTER_LOCATION}"

MONITORING_SERVICE="$(gcloud container clusters describe "${CLUSTER_NAME}" --zone "${CLUSTER_LOCATION}" --project "${PROJECT}" | grep monitoringService | cut -d ' ' -f 2)"
echo "============================================="
if [[ "${MONITORING_SERVICE}" == "monitoring.googleapis.com" ]]; then
  echo "GKE-Native Experience: Legacy Monitoring Experience (gke_container)"
else
  echo "GKE-Native Experience: BETA Monitoring Experience (k8s_cluster, k8s_node, k8s_pod, k8s_container)"
fi

echo "============================================="
echo "Kubectl Version"
kubectl version
echo "============================================="

TOTAL_NODES="$(kubectl get nodes --no-headers | wc -l)"
echo "Total Nodes: ${TOTAL_NODES}"
kubectl get nodes | awk '{$1=""; print $0}'

echo "============================================="

function get_workload_name {
  namespace="$1"
  workload_type="$2"
  workload_name="$3"

  kubectl -n "${namespace}" get "${workload_type}" | grep "${workload_name}" | awk '{print $1}'
}

function get_workload_image {
  namespace="$1"
  workload_type="$2"
  workload_name="$3"
  container_image="$4"

  if [[ -n "${workload_name}" ]]; then
    kubectl -n "${namespace}" get "${workload_type}" "${workload_name}" -o yaml | grep "image: " | grep "${container_image}" | awk '{print $2}'
  fi
}

function count_pods {
  namespace="$1"
  workload_name="$2"
  exclude="${3:-zzzzzzzzz}"

  if [[ -n "${workload_name}" ]]; then
    echo "$(kubectl -n "${namespace}" get pods --no-headers | grep "${workload_name}" | grep -v "${exclude}" | wc -l)"
  fi
}

function get_logging_agent_resource_model {
  namespace="$1"
  workload_name="$2"

  if [[ -n "${workload_name}" ]]; then
    config_name="$(kubectl -n "${namespace}" describe ds "${workload_name}" | grep -A 2 "ConfigMap" | grep "fluentd-gcp\|logging-agent-config" | grep "Name" | awk '{print $ ...
    logging_agent_resource_model="${LEGACY_RESOURCE_MODEL}"
    if [[ -n "$(kubectl -n "${namespace}" describe configmap "${config_name}" | grep "local_resource_id")" ]]; then
      logging_agent_resource_model="${NEW_RESOURCE_MODEL}"
    fi

    echo "${logging_agent_resource_model}"
  fi
}

function get_heapster_resource_model {
  namespace="$1"
  workload_name="$2"

  if [[ -n "${workload_name}" ]]; then
    heapster_resource_model="${LEGACY_RESOURCE_MODEL}"
    if [[ -n "$(kubectl describe deployment -n "${namespace}" "${workload_name}" | grep "use_new_resources=true")" ]]; then
      heapster_resource_model="${NEW_RESOURCE_MODEL}"
    fi

    echo "${heapster_resource_model}"
  fi
}

function dump_agent_info {
  workload_description="$1"
  workload_name="$2"
  workload_image="$3"
  total_pods="$4"
  resource_model="$5"

  echo "============================================="
  echo "Workload Description: ${workload_description}"
  if [[ -n "${workload_name}" ]]; then
    echo "       Workload Name: ${workload_name}"
    echo "      Workload Image: ${workload_image}"
    echo "          Total Pods: ${total_pods}"
    echo "      Resource Model: ${resource_model}"
  else
    echo "Not installed."
  fi
}

function diagnose_installation {
  NAMESPACE="$1"
  INSTALLATION_PATH="GKE-Native Installation"
  if [[ "${NAMESPACE}" == "stackdriver-agents" ]]; then
    INSTALLATION_PATH="Custom Installation"
  fi

  echo "============================================="
  echo "==="
  echo "=== Inspecting ${INSTALLATION_PATH}"
  echo "==="

  # Diagnose Logging Agent
  if [[ "${NAMESPACE}" == "kube-system" ]]; then
    LOGGING_AGENT_NAME="$(get_workload_name "${NAMESPACE}" "daemonset" "fluentd-gcp")"
    LOGGING_AGENT_IMAGE="$(get_workload_image "${NAMESPACE}" "daemonset" "${LOGGING_AGENT_NAME}" "fluentd\|logging-agent")"
  else
    # Stackdriver Configs
    LOGGING_AGENT_NAME="$(get_workload_name "${NAMESPACE}" "daemonset" "logging-agent")"
    LOGGING_AGENT_IMAGE="$(get_workload_image "${NAMESPACE}" "daemonset" "${LOGGING_AGENT_NAME}" "logging-agent")"
  fi
  LOGGING_AGENT_TOTAL_PODS="$(count_pods "${NAMESPACE}" "${LOGGING_AGENT_NAME}")"
  LOGGING_AGENT_RESOURCE_MODEL="$(get_logging_agent_resource_model "${NAMESPACE}" "${LOGGING_AGENT_NAME}")"

  dump_agent_info "Logging Agent" "${LOGGING_AGENT_NAME}" "${LOGGING_AGENT_IMAGE}" "${LOGGING_AGENT_TOTAL_PODS}" "${LOGGING_AGENT_RESOURCE_MODEL}"

  # Diagnose Metadata Agent (Node Level)
  METADATA_AGENT_NODE_LEVEL_NAME="$(get_workload_name "${NAMESPACE}" "daemonset" "metadata-agent")"
  METADATA_AGENT_NODE_LEVEL_IMAGE="$(get_workload_image "${NAMESPACE}" "daemonset" "${METADATA_AGENT_NODE_LEVEL_NAME}" "metadata-agent")"
  METADATA_AGENT_NODE_LEVEL_TOTAL_PODS="$(count_pods "${NAMESPACE}" "${METADATA_AGENT_NODE_LEVEL_NAME}" "cluster-level")"

  dump_agent_info "Metadata Agent (Node Level)" "${METADATA_AGENT_NODE_LEVEL_NAME}" "${METADATA_AGENT_NODE_LEVEL_IMAGE}" "${METADATA_AGENT_NODE_LEVEL_TOTAL_PODS}" "${NEW_RE ...
  # Diagnose Metadata Agent (Cluster Level)
  METADATA_AGENT_CLUSTER_LEVEL_NAME="$(get_workload_name "${NAMESPACE}" "deployment" "metadata-agent-cluster-level")"
  METADATA_AGENT_CLUSTER_LEVEL_IMAGE="$(get_workload_image "${NAMESPACE}" "deployment" "${METADATA_AGENT_CLUSTER_LEVEL_NAME}" "metadata-agent")"
  METADATA_AGENT_CLUSTER_LEVEL_TOTAL_PODS="$(count_pods "${NAMESPACE}" "${METADATA_AGENT_CLUSTER_LEVEL_NAME}")"

  dump_agent_info "Metadata Agent (Cluster Level)" "${METADATA_AGENT_CLUSTER_LEVEL_NAME}" "${METADATA_AGENT_CLUSTER_LEVEL_IMAGE}" "${METADATA_AGENT_CLUSTER_LEVEL_TOTAL_PODS ...
  # Heapster
  HEAPSTER_NAME="$(get_workload_name "${NAMESPACE}" "deployment" "heapster")"
  HEAPSTER_IMAGE="$(get_workload_image "${NAMESPACE}" "deployment" "${HEAPSTER_NAME}" "heapster")"
  HEAPSTER_TOTAL_PODS="$(count_pods "${NAMESPACE}" "${HEAPSTER_NAME}")"
  HEAPSTER_RESOURCE_MODEL="$(get_heapster_resource_model "${NAMESPACE}" "${HEAPSTER_NAME}")"

  dump_agent_info "Heapster" "${HEAPSTER_NAME}" "${HEAPSTER_IMAGE}" "${HEAPSTER_TOTAL_PODS}" "${HEAPSTER_RESOURCE_MODEL}"

  echo "============================================="
  echo "==="
  echo "=== Done Inspecting ${INSTALLATION_PATH}"
  echo "==="
  echo "============================================="
}

diagnose_installation "kube-system"
diagnose_installation "stackdriver-agents"

@lawliet89
Copy link

lawliet89 commented Apr 2, 2019

Line 81 of your script is truncated.
(also 154 and 160)

Fetching cluster endpoint and auth data.
kubeconfig entry generated for cluster-name.
=============================================
GKE-Native Experience: BETA Monitoring Experience (k8s_cluster, k8s_node, k8s_pod, k8s_container)
=============================================
Kubectl Version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-gke.7", GitCommit:"aeaa96020ec0614a8773799058c3b8d58c19b9ff", GitTreeState:"clean", BuildDate:"2019-03-13T11:22:57Z", GoVersion:"go1.10.8b4", Compiler:"gc", Platform:"linux/amd64"}
=============================================
Total Nodes: 11
 STATUS ROLES AGE VERSION
 Ready <none> 24d v1.12.6-gke.7
 Ready <none> 25d v1.12.6-gke.7
 Ready <none> 25d v1.12.6-gke.7
 Ready <none> 25d v1.12.6-gke.7
 Ready <none> 4d17h v1.12.6-gke.7
 Ready <none> 11d v1.12.6-gke.7
 Ready <none> 18h v1.12.6-gke.7
 Ready <none> 4d17h v1.12.6-gke.7
 Ready <none> 6m26s v1.12.6-gke.7
 Ready <none> 6m25s v1.12.6-gke.7
 Ready <none> 6m24s v1.12.6-gke.7
=============================================
./metadata.sh: line 80: unexpected EOF while looking for matching `''

@lawliet89
Copy link

lawliet89 commented Apr 2, 2019

I think I have managed to fix the truncated parts and this is the output:

Fetching cluster endpoint and auth data.
kubeconfig entry generated for <cluster-name>.
=============================================
GKE-Native Experience: BETA Monitoring Experience (k8s_cluster, k8s_node, k8s_pod, k8s_container)
=============================================
Kubectl Version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-gke.7", GitCommit:"aeaa96020ec0614a8773799058c3b8d58c19b9ff", GitTreeState:"clean", BuildDate:"2019-03-13T11:22:57Z", GoVersion:"go1.10.8b4", Compiler:"gc", Platform:"linux/amd64"}
=============================================
Total Nodes: 10
 STATUS ROLES AGE VERSION
 Ready <none> 24d v1.12.6-gke.7
 Ready <none> 25d v1.12.6-gke.7
 Ready <none> 25d v1.12.6-gke.7
 Ready <none> 25d v1.12.6-gke.7
 Ready <none> 4d17h v1.12.6-gke.7
 Ready <none> 11d v1.12.6-gke.7
 Ready <none> 19h v1.12.6-gke.7
 Ready <none> 4d18h v1.12.6-gke.7
 Ready <none> 17m v1.12.6-gke.7
 Ready <none> 17m v1.12.6-gke.7
=============================================
=============================================
===
=== Inspecting GKE-Native Installation
===
Error from server (NotFound): configmaps "Name:" not found
=============================================
Workload Description: Logging Agent
       Workload Name: fluentd-gcp-v3.1.1
      Workload Image: gcr.io/stackdriver-agents/stackdriver-logging-agent:1.6.4
          Total Pods: 10
      Resource Model: gke
=============================================
Workload Description: Metadata Agent (Node Level)
Not installed.
=============================================
Workload Description: Metadata Agent (Cluster Level)
       Workload Name: stackdriver-metadata-agent-cluster-level
      Workload Image: gcr.io/stackdriver-agents/metadata-agent-go:1.0.0
          Total Pods: 1
      Resource Model: 
=============================================
Workload Description: Heapster
       Workload Name: heapster-v1.6.1
      Workload Image: gcr.io/stackdriver-agents/heapster-amd64:v1.6.1
          Total Pods: 1
      Resource Model: k8s
=============================================
===
=== Done Inspecting GKE-Native Installation
===
=============================================
=============================================
===
=== Inspecting Custom Installation
===
No resources found.
=============================================
Workload Description: Logging Agent
Not installed.
No resources found.
=============================================
Workload Description: Metadata Agent (Node Level)
Not installed.
No resources found.
=============================================
Workload Description: Metadata Agent (Cluster Level)
Not installed.
No resources found.
=============================================
Workload Description: Heapster
Not installed.
=============================================
===
=== Done Inspecting Custom Installation
===
=============================================

Looking at the logs of the stackdriver-metadata-agent-cluster-level pods, it's just repeated:

W0402 02:31:18.052931       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:18.101105       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:18.164427       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:18.226599       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:18.360010       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:20.109427       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:20.171217       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:20.232936       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:20.314705       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:20.370073       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:21.864452       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:22.120965       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:22.182260       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:22.204739       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:22.267195       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:22.380561       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:22.642979       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:22.721457       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:23.056181       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:23.901266       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:24.120229       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:24.180051       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:24.238140       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:24.391336       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:24.737276       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:24.923219       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:25.350325       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:26.129378       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:26.187427       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:26.242822       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:26.400924       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:26.401077       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:26.474784       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:27.113676       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:28.063960       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:28.139305       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:28.193968       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:28.252132       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0402 02:31:28.412887       1 kubernetes.go:99] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request

@bmoyles0117
Copy link
Contributor

To clear an assumption, metadata is no longer being written to logs via the metadata labels, the metadata properties are now embedded directly into the log entry. So you should still be able to query by pod labels with values that exist in the log entry.

Did you ensure that a Stackdriver Account exists by clicking "Monitoring" inside of your Google Cloud project?

@lawliet89
Copy link

lawliet89 commented Apr 2, 2019

To clear an assumption, metadata is no longer being written to logs via the metadata labels, the metadata properties are now embedded directly into the log entry. So you should still be able to query by pod labels with values that exist in the log entry.

Correct. It's just that when you click on the container logs link from the Console, they query the metadata labels which do not exist and nothing shows up.

Did you ensure that a Stackdriver Account exists by clicking "Monitoring" inside of your Google Cloud project?

Yes. I have created the necessary workspace too.

@bmoyles0117
Copy link
Contributor

At this point your best avenue is to submit a ticket directly to Cloud Support so that we can triage your clusters directly, it seems as though everything is configured properly, yet you're still getting errors. My only last guess would be if you have a custom network associated to your cluster instead of the default network provided by GKE clusters.

@bmoyles0117
Copy link
Contributor

@lawliet89 what is the location for your cluster? Is it a GCP region (such as us-east1-d) or a GCP zone (such as us-east1)?

@lawliet89
Copy link

Alright. It happened to two gke clusters at the same time. It's bizarre.

It's a regional cluster. (ap-southeast1)

@bmoyles0117
Copy link
Contributor

Alright the bug is confirmed, this is impacting regional clusters due to a bug with how we're sending their identity to the backend. We have a fix in place but it may take a few weeks to actually roll out. The side affect of this is log spam from the metadata agent as you've seen, but the Kubernetes Monitoring UI should still be visible as this data is collected from two separate sources.

@MrBlaise
Copy link

MrBlaise commented May 17, 2019

@bmoyles0117

Hi, I have enabled stackdriver api, using a custom service account with my kubernetes cluster with the proper privileges and I still get Failed to publish resource metadata: Unexpected response code from Stackdriver API: 403 Forbidden

I have fixed and ran your script it had the following output:

Fetching cluster endpoint and auth data.
kubeconfig entry generated for platform.
=============================================
GKE-Native Experience: BETA Monitoring Experience (k8s_cluster, k8s_node, k8s_pod, k8s_container)
=============================================
Kubectl Version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-19T22:12:47Z", GoVersion:"go1.12.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.7-gke.10", GitCommit:"8d9b8641e72cf7c96efa61421e87f96387242ba1", GitTreeState:"clean", BuildDate:"2019-04-12T22:59:24Z", GoVersion:"go1.10.8b4", Compiler:"gc", Platform:"linux/amd64"}
=============================================
Total Nodes:        3
 STATUS ROLES AGE VERSION
 Ready <none> 94d v1.12.7-gke.10
 Ready <none> 17d v1.12.7-gke.10
 Ready <none> 17d v1.12.7-gke.10
=============================================
=============================================
===
=== Inspecting GKE-Native Installation
===
=============================================
Workload Description: Logging Agent
       Workload Name: fluentd-gcp-v3.1.1
      Workload Image: gcr.io/stackdriver-agents/stackdriver-logging-agent:1.6.6
          Total Pods:        3
      Resource Model: k8s
=============================================
Workload Description: Metadata Agent (Node Level)
Not installed.
=============================================
Workload Description: Metadata Agent (Cluster Level)
       Workload Name: stackdriver-metadata-agent-cluster-level
      Workload Image: gcr.io/stackdriver-agents/metadata-agent-go:1.0.2
          Total Pods:        1
      Resource Model:
=============================================
Workload Description: Heapster
       Workload Name: heapster-v1.6.1
      Workload Image: gcr.io/stackdriver-agents/heapster-amd64:v1.6.1
          Total Pods:        1
      Resource Model: k8s
=============================================
===
=== Done Inspecting GKE-Native Installation
===
=============================================
=============================================
===
=== Inspecting Custom Installation
===
No resources found.
=============================================
Workload Description: Logging Agent
Not installed.
No resources found.
=============================================
Workload Description: Metadata Agent (Node Level)
Not installed.
No resources found.
=============================================
Workload Description: Metadata Agent (Cluster Level)
Not installed.
No resources found.
=============================================
Workload Description: Heapster
Not installed.
=============================================
===
=== Done Inspecting Custom Installation
===
=============================================

What could be the problem?

EDIT:

I am seeing 100% error on google.cloud.stackdriver.v1beta3.ResourceService.PublishResourceMetadata

I have a stackdriver workspace connected to my project.

I have added the following roles to my kubernetes service account:

    "roles/logging.logWriter",
    "roles/monitoring.metricWriter",
    "roles/monitoring.viewer",
    "roles/cloudtrace.agent"

My kubernetes cluster is zonal, private (with master enabled) on a custom vpc network.

@igorpeshansky
Copy link
Member

Your custom service account also needs to have the "Stackdriver Resource Metadata Writer" ("roles/ stackdriver.resourceMetadata.writer") role. Yes, we know it's not properly documented, and are taking steps to rectify this. Sorry for the issues.

@MrBlaise
Copy link

@igorpeshansky Thanks a lot! I have been debugging this for ages!

Miradorn added a commit to Miradorn/terraform-google-gke that referenced this issue Jun 27, 2019
There is an undocumented Role needed (see Stackdriver/kubernetes-configs#25 (comment)) to use the new stackdriver for Kubernetes, else this happens:

```
W0627 11:44:41.308933       1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission
W0627 11:44:41.408116       1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission
W0627 11:44:41.635170       1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission
W0627 11:44:42.107927       1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission
W0627 11:44:42.308177       1 kubernetes.go:113] Failed to publish resource metadata: rpc error: code = PermissionDenied desc = The caller does not have permission
W06
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants