Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog: Exporter logs meaningless errors when run in (default) os-less container images, has excessively noisy runtime logs, does not wrap gohai package logger in otel logger #29741

Closed
ringerc opened this issue Dec 11, 2023 · 5 comments · Fixed by #31703
Labels
bug Something isn't working exporter/datadog Datadog components priority:p3 Lowest

Comments

@ringerc
Copy link

ringerc commented Dec 11, 2023

Component(s)

exporter/datadog

What happened?

Description

The datadog exporter produces a variety of useless, confusing and meaningless errors when run with the default os-less container images for the OpenTelemetry collector.

These are emitted by the exporter itself, or by the "gohai" packages within the datadog agent that the exporter calls. They are emitted at info level (default) despite effectively being meaningless noise. One of them ignores the configured log level and log formatter too.

They should be caught and suppressed by the exporter, as they are expected, they are not actionable, and they are effectively meaningless.

These errors (with irrelevant fields trimmed for brevity) include:

{
  "level": "info",
  "caller": "gohai/gohai.go:35",
  "msg": "Failed to retrieve filesystem metadata",
  "kind": "exporter",
  "data_type": "metrics",
  "name": "datadog/datadog",
  "error": "df failed to collect filesystem data: %!s(<nil>)"
}
{
  "level": "info",
  "caller": "gohai/gohai.go:54",
  "msg": "Failed to retrieve platform metadata",
  "kind": "exporter",
  "data_type": "metrics",
  "name": "datadog/datadog",
  "error": "exec: \"uname\": executable file not found in $PATH"
}

and this error emitted by the gohai libs that ignores the opentelemetry logger's configured log level and format, so it's emitted as a non-json line when json logging is on, and it's emitted in >= info log level even though it's marked "debug":

1700800615862591810 [Debug] Error fetching info for pid 1: open /etc/passwd: no such file or directory

This message has no business being a warning - it should be debug level:

{
  "level": "warn",
  "caller": "[email protected]/zaplogger.go:49",
  "msg": "Trace writer initialized (climit=100 qsize=1)",
  "kind": "exporter",
  "data_type": "metrics",
  "name": "datadog/datadog"
}

These messages are less clearly spurious, but could possible also do with being downgraded to debug since they're just runtime noise:

{
  "level": "info",
  "caller": "[email protected]/zaplogger.go:38",
  "msg": "Starting Agent with processor trace buffer of size 0",
  "kind": "exporter",
  "data_type": "metrics",
  "name": "datadog/datadog"
}
{
  "level": "info",
  "caller": "[email protected]/zaplogger.go:38",
  "msg": "Receiver configured with 2 decoders and a timeout of 0ms",
  "kind": "exporter",
  "data_type": "metrics",
  "name": "datadog/datadog"
}

Steps to Reproduce

Run the datadog exporter demo configuration on a docker or k8s container. Basically run the tutorial for the datadog opentelemetry exporter. Use the image otel/opentelemetry-collector-contrib:0.90.1.

Expected Result

At info level, I expect minimal, meaningful log output.

None of these mentioned message should be logged.

Any genuine errors or informational messages from the gohai package should be wrapped via a log adapter so that they respect the otel collector's logging configuration instead of writing a different log format and ignoring the log level.

Actual Result

The excessively noisy logs, ignored log levels, and incorrectly formatted logs reported above.

Collector version

0.90.1

Environment information

Environment

Run in any k8s environment or as a Docker container, using the default configs, with the container image otel/opentelemetry-collector-contrib:0.90.1. Read-only bind-mount the host /proc to /host/proc in your container definition.

OpenTelemetry Collector configuration

# My config is huge so I'll provide the relevant excerpts and a dummy receiver
receivers:
  hostmetrics:
    root_path: /host
exporters:
  datadog/datadog:                                                                                                                                            
    api:
      key: ${env:DD_API_KEY}                                                                                                                                  
      site: ${env:DD_SITE}                                                                                                                                    
    metrics:
      resource_attributes_as_tags: true
      instrumentation_scope_metadata_as_tags: false
    host_metadata:
      enabled: true
      hostname_source: first_resource
service:
  telemetry: 
    # that gohai [Debug] log line ignores this config             
    logs:
      encoding: "json"
      level: "info"
  pipelines:
    metrics/datadog:
      # real receivers trimmed because they are irrelevant
      receivers: ["hostmetrics"]
      # real processors trimmed because they are irrelevant
      processors: []
      exporters: ["datadog/datadog"]

Log output

{"level":"info","ts":1702327010.2295797,"caller":"provider/provider.go:59","msg":"Resolved source","service":"otel-node-agent","kind":"exporter","data_type":"metrics","name":"datadog/datadog","provider":"ec2","source":{"Kind":"host","Identifier":"i-0ad96d927c1802a8e"}}
{"level":"info","ts":1702327010.2296593,"caller":"[email protected]/zaplogger.go:38","msg":"Starting Agent with processor trace buffer of size 0","service":"otel-node-agent","kind":"exporter","data_type":"metrics","name":"datadog/datadog"}
{"level":"info","ts":1702327010.2298276,"caller":"[email protected]/zaplogger.go:38","msg":"Receiver configured with 2 decoders and a timeout of 0ms","service":"otel-node-agent","kind":"exporter","data_type":"metrics","name":"datadog/datadog"}
{"level":"warn","ts":1702327010.2301986,"caller":"[email protected]/zaplogger.go:49","msg":"Trace writer initialized (climit=100 qsize=1)","service":"otel-node-agent","kind":"exporter","data_type":"metrics","name":"datadog/datadog"}
{"level":"info","ts":1702327010.2307827,"caller":"clientutil/api.go:40","msg":"Validating API key.","service":"otel-node-agent","kind":"exporter","data_type":"metrics","name":"datadog/datadog"}
{"level":"info","ts":1702327010.6152675,"caller":"clientutil/api.go:44","msg":"API key validation successful.","service":"otel-node-agent","kind":"exporter","data_type":"metrics","name":"datadog/datadog"}
{"level":"info","ts":1702327020.2504048,"caller":"gohai/gohai.go:35","msg":"Failed to retrieve filesystem metadata","service":"otel-node-agent","kind":"exporter","data_type":"metrics","name":"datadog/datadog","error":"df failed to collect filesystem data: %!s(<nil>)"}
{"level":"info","ts":1702327020.2510517,"caller":"gohai/gohai.go:54","msg":"Failed to retrieve platform metadata","service":"otel-node-agent","kind":"exporter","data_type":"metrics","name":"datadog/datadog","error":"exec: \"uname\": executable file not found in $PATH"}

Additional context

See also

Some of the errors are from the datadog exporter:

because the exporter incorrectly assumes it's in a "rich" full-os container with a shell, external commands etc.

The [Debug] Error one comes from https://github.com/DataDog/datadog-agent/blob/45c774dba115b395c1b09a94fcd428f49d6d440a/pkg/gohai/processes/gops/process_info.go#L60

@ringerc ringerc added bug Something isn't working needs triage New item requiring triage labels Dec 11, 2023
@github-actions github-actions bot added the exporter/datadog Datadog components label Dec 11, 2023
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@ringerc ringerc changed the title datadog: Exporter logs meaningless errors when run in (default) os-less container images, has excessively noisy runtime logs datadog: Exporter logs meaningless errors when run in (default) os-less container images, has excessively noisy runtime logs, does not wrap gohai package logger in otel logger Dec 11, 2023
@brettplarson
Copy link

This issue seems to be about fixing the log message. More broadly - should this gohai package be included? Or should this feature be removed? Just wondering if there is a fix outside of this logging incorrectly. Sorry if this is being addressed in another ticket.

@mx-psi
Copy link
Member

mx-psi commented Jan 10, 2024

Or should this feature be removed?

Together with other vendors we are working on improving OpenTelemetry semantic conventions related to system and infra monitoring and making sure the resource detection processor and host metrics receiver can fetch this information. This will eventually allow us to remove the gohai detection, once we are closer to feature parity with gohai for bare-metal runs. While I agree gohai is not very useful in containerized runs with a from scratch container, such as the official containers, they are useful to other users, so we can't remove gohai just yet :)

For the specifics on the work we are doing you can see for example #30306, #29588, #24542, #24450, #22940 and more generally the work on the System Semantic Conventions Working Group (see board). Datadog specific docs on how to best leverage these attributes is not yet available, but it's also on our roadmap to improve this.

@ringerc
Copy link
Author

ringerc commented Jan 31, 2024

Good to know. The Error fetching info for pid 1 is particularly frustrating though, given that it is not properly wrapped in the otel collector logger.

It appears that gohai uses log "github.com/cihub/seelog" and the datadog exporter collector components that import it do not add a suitable logging adapter, or propagate the collector log level to the "seelog" logger. I'm unfamiliar with seelog and not sufficiently familiar with the otel collector internal logs to make a PR for it in a timely manner.

The other listed errors should IMO be demoted to "debug" level.

@mx-psi
Copy link
Member

mx-psi commented Mar 12, 2024

This will be fixed by #31703 by reducing the logs to debug level. We can keep track of the non-wrapped pid log in DataDog/datadog-agent/issues/21487 and #31193

I also wanted to mention on this issue as well that Datadog-specific docs I mentioned here #29741 (comment) are now available: https://docs.datadoghq.com/opentelemetry/schema_semantics/host_metadata/

mx-psi added a commit that referenced this issue Mar 12, 2024
**Description:** Demote gohai logs to debug level

**Link to tracking Issue:** Fixes #29741
DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
)

**Description:** Demote gohai logs to debug level

**Link to tracking Issue:** Fixes open-telemetry#29741
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
)

**Description:** Demote gohai logs to debug level

**Link to tracking Issue:** Fixes open-telemetry#29741
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/datadog Datadog components priority:p3 Lowest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants