Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat generates large and repetitive permission error logs #41890

Closed
cmacknz opened this issue Dec 4, 2024 · 12 comments · Fixed by elastic/elastic-agent-system-metrics#195
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Dec 4, 2024

Here is an example:

{"log.level":"error","@timestamp":"2024-12-03T20:27:50.168Z","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch","file.name":"module/wrapper.go","file.line":266},"message":"Error fetching data for metricset system.process: Not enough privileges to fetch information: Not enough privileges to fetch information: non fatal error fetching PID some info for 1, metrics are valid, but partial: non-fatal error fetching PID metrics for 1, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: read /hostfs/proc/1/io: permission denied\nnon fatal error fetching PID some info for 2, metrics are valid, but partial: non-fatal error fetching PID metrics for 2, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: read /hostfs/proc/2/io: permission denied\nnon fatal error fetching PID some info for 3, metrics are valid, but partial: non-fatal error fetching PID metrics for 3, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: read /hostfs/proc/3/io: permission denied\nnon fatal error fetching PID some info for 4, metrics are valid, but partial: non-fatal error fetching PID......

Here is an additional screenshot indicating that these errors can become ludicrously large in some circumstances:

Image

It looks like we might be reporting the same error for every single PID on the system. We need to report the general error a single time instead.

@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 4, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@fearful-symmetry
Copy link
Contributor

Here, in wrapper.go:

			reporter.V2().Error(err)
			if errors.As(err, &mb.PartialMetricsError{}) {
				// mark module as running if metrics are partially available and display the error message
				msw.module.UpdateStatus(status.Running, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err))
			} else {
				// mark it as degraded for any other issue encountered
				msw.module.UpdateStatus(status.Degraded, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err))
			}
			logp.Err("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.Name(), err)

I'm guessing that should be logp.Debug, and not logp.Err

@VihasMakwana
Copy link
Contributor

I agree.

If we're having partial metrics, log it a debug level.
Or else, log it at error level.

@VihasMakwana
Copy link
Contributor

Even if we're logging at a debug level, the log is huge. I'll work on narrowing it down. Maybe something like:

Failed to fetch all metrics for processes: [1,2,3,...]

@cmacknz
Copy link
Member Author

cmacknz commented Dec 5, 2024

It is always possible for that error to be every PID on the machine. So maybe something like Failed to fetch all metrics for PID X and N other PIDs if there was more than one.

VihasMakwana added a commit to elastic/elastic-agent-system-metrics that referenced this issue Dec 12, 2024
This is to reduce the log pollution in default case. If the user is
interested to view the full log, they can enable "debug" logging.

Closes elastic/beats#41890
@VihasMakwana
Copy link
Contributor

logs looks like this now

{"log.level":"error","@timestamp":"2024-12-12T13:59:11.076+0530","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).handleFetchError","file.name":"module/wrapper.go","file.line":324},"message":"Error fetching data for metricset system.process_summary: non fatal error; reporting partial metrics: error fetching PID metrics for 316 processes, most likely a \"permission denied\" error. Enable debug logging to determine the exact cause.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-12-12T13:59:16.308+0530","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).handleFetchError","file.name":"module/wrapper.go","file.line":324},"message":"Error fetching data for metricset system.process: non fatal error; reporting partial metrics: error fetching PID metrics for 316 processes, most likely a \"permission denied\" error. Enable debug logging to determine the exact cause.","service.name":"metricbeat","ecs.version":"1.6.0"}

@yavor-ivanov-covantis
Copy link

yavor-ivanov-covantis commented Dec 16, 2024

Elastic agents versions 8.17 and 8.16 are all affected.

@cmacknz
Copy link
Member Author

cmacknz commented Dec 16, 2024

@VihasMakwana was the elastic-agent-system-metrics package updated in main+8.17+8.16 with this? If not, we should re-open the issue until that is done. Merging the PR into the system metrics package doesn't fix this by itself.

@VihasMakwana
Copy link
Contributor

@cmacknz I was waiting for this backport PR's #42024 CI to get green. I'll reopen this until we verify it's fixed.

@VihasMakwana VihasMakwana reopened this Dec 16, 2024
@VihasMakwana
Copy link
Contributor

VihasMakwana commented Dec 18, 2024

Backports are now merged.

@VihasMakwana
Copy link
Contributor

Closing this as the elastic-agent-system-metrics is updated on main+8.16+8.17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
6 participants