Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit repetitive error logging in ingesters #5894

Open
jhalterman opened this issue Aug 31, 2023 · 3 comments
Open

Limit repetitive error logging in ingesters #5894

jhalterman opened this issue Aug 31, 2023 · 3 comments

Comments

@jhalterman
Copy link
Member

Mimir ingesters can occasionally become bogged down logging high volumes of repeated errors, such as "out of bounds" errors. This could happen following a temporary outage of some ingesters. Ideally, these repeated errors should be sampled, and perhaps replaced with a metric.

Let's update Mimir to use sampled logging for errors that might put unnecessary pressure on ingesters.

@pr00se
Copy link
Contributor

pr00se commented Sep 7, 2023

related: #1900

@seizethedave
Copy link
Contributor

seizethedave commented Mar 21, 2024

Around the time when this issue was created, Mimir also gained a log sampling facility (#5584). A bunch of errors are now being downsampled:

type ingesterErrSamplers struct {
sampleTimestampTooOld *log.Sampler
sampleTimestampTooOldOOOEnabled *log.Sampler
sampleTimestampTooFarInFuture *log.Sampler
sampleOutOfOrder *log.Sampler
sampleDuplicateTimestamp *log.Sampler
maxSeriesPerMetricLimitExceeded *log.Sampler
maxMetadataPerMetricLimitExceeded *log.Sampler
maxSeriesPerUserLimitExceeded *log.Sampler
maxMetadataPerUserLimitExceeded *log.Sampler
}

(Although at the moment it only seems to be honored in ingester gRPC middleware - so, errors returned from gRPC calls, but not errors sent directly to a logger (#7690).)

Related and around the same time was overall rate limiting support for emitted logs (#5764). That one comes with metrics (logger_rate_limit_discarded_log_lines_total), and looking at recent weeks, it is definitely drastically reducing log volume. (At times, eliminating ~1 million logs per second.)


I could use some help finding types of errors that we still think are causing floods. I'll ping you internally to see if we can locate more.

Ideally, these repeated errors should be sampled, and perhaps replaced with a metric.

Supplementing sampled errors with an accurate counter metric would no doubt be valuable. I will give that some thought as I learn more here.

@seizethedave
Copy link
Contributor

seizethedave commented Mar 25, 2024

I have surveyed the scene. With the implementation of #5584 and #5764, I have found no logs that are still egregiously abusing our systems. Both of those changes are protecting the system from major flash logging floods.

There are a couple of things that we could do as part of this issue:

  • Implement rate limits on explicit logger invocations. (Wasn't done as part of Sampled logging: log only 1 in N of specific errors #5584.)
  • Moving the per-error sampled logs of Sampled logging: log only 1 in N of specific errors #5584 to use a time-based rate limiter rather than frequency-based. This would improve big-tenant + small-tenant colocation problems where 1-in-10 frequency-sampling of (99 tenant A logs + 1 tenant B log) will normally cause logs about tenant B's workload to vanish. However, the problem will be similar as our limiters are ingester-scoped (and not per tenant.)

I think one viable option is to close this ticket and focus on what I suspect is the ultimate solution: #1900. (Because both of those options noted above could be flexibly handled by #1900.)

@pracucci pracucci mentioned this issue Apr 3, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants