Limit repetitive error logging in ingesters #5894

jhalterman · 2023-08-31T16:36:38Z

Mimir ingesters can occasionally become bogged down logging high volumes of repeated errors, such as "out of bounds" errors. This could happen following a temporary outage of some ingesters. Ideally, these repeated errors should be sampled, and perhaps replaced with a metric.

Let's update Mimir to use sampled logging for errors that might put unnecessary pressure on ingesters.

pr00se · 2023-09-07T18:14:41Z

related: #1900

seizethedave · 2024-03-21T19:02:14Z

Around the time when this issue was created, Mimir also gained a log sampling facility (#5584). A bunch of errors are now being downsampled:

mimir/pkg/ingester/errors.go

Lines 543 to 553 in c0038b1

    
           type ingesterErrSamplers struct { 
        
           	sampleTimestampTooOld             *log.Sampler 
        
           	sampleTimestampTooOldOOOEnabled   *log.Sampler 
        
           	sampleTimestampTooFarInFuture     *log.Sampler 
        
           	sampleOutOfOrder                  *log.Sampler 
        
           	sampleDuplicateTimestamp          *log.Sampler 
        
           	maxSeriesPerMetricLimitExceeded   *log.Sampler 
        
           	maxMetadataPerMetricLimitExceeded *log.Sampler 
        
           	maxSeriesPerUserLimitExceeded     *log.Sampler 
        
           	maxMetadataPerUserLimitExceeded   *log.Sampler 
        
           }

(Although at the moment it only seems to be honored in ingester gRPC middleware - so, errors returned from gRPC calls, but not errors sent directly to a logger (#7690).)

Related and around the same time was overall rate limiting support for emitted logs (#5764). That one comes with metrics (logger_rate_limit_discarded_log_lines_total), and looking at recent weeks, it is definitely drastically reducing log volume. (At times, eliminating ~1 million logs per second.)

I could use some help finding types of errors that we still think are causing floods. I'll ping you internally to see if we can locate more.

Ideally, these repeated errors should be sampled, and perhaps replaced with a metric.

Supplementing sampled errors with an accurate counter metric would no doubt be valuable. I will give that some thought as I learn more here.

seizethedave · 2024-03-25T18:54:17Z

I have surveyed the scene. With the implementation of #5584 and #5764, I have found no logs that are still egregiously abusing our systems. Both of those changes are protecting the system from major flash logging floods.

There are a couple of things that we could do as part of this issue:

Implement rate limits on explicit logger invocations. (Wasn't done as part of Sampled logging: log only 1 in N of specific errors #5584.)
Moving the per-error sampled logs of Sampled logging: log only 1 in N of specific errors #5584 to use a time-based rate limiter rather than frequency-based. This would improve big-tenant + small-tenant colocation problems where 1-in-10 frequency-sampling of (99 tenant A logs + 1 tenant B log) will normally cause logs about tenant B's workload to vanish. However, the problem will be similar as our limiters are ingester-scoped (and not per tenant.)

I think one viable option is to close this ticket and focus on what I suspect is the ultimate solution: #1900. (Because both of those options noted above could be flexibly handled by #1900.)

jhalterman added squad/ingest postmortem labels Aug 31, 2023

jhalterman mentioned this issue Aug 31, 2023

Sampled logging: log only 1 in N of specific errors #5584

Merged

3 tasks

duricanikolic mentioned this issue Sep 12, 2023

Move error samplers from Limiter to Ingester #6014

Merged

3 tasks

pracucci mentioned this issue Apr 3, 2024

Update dskit #7784

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit repetitive error logging in ingesters #5894

Limit repetitive error logging in ingesters #5894

jhalterman commented Aug 31, 2023

pr00se commented Sep 7, 2023

seizethedave commented Mar 21, 2024 •

edited

Loading

seizethedave commented Mar 25, 2024 •

edited

Loading

Limit repetitive error logging in ingesters #5894

Limit repetitive error logging in ingesters #5894

Comments

jhalterman commented Aug 31, 2023

pr00se commented Sep 7, 2023

seizethedave commented Mar 21, 2024 • edited Loading

seizethedave commented Mar 25, 2024 • edited Loading

seizethedave commented Mar 21, 2024 •

edited

Loading

seizethedave commented Mar 25, 2024 •

edited

Loading