ingest consumer: handle Push errors #6940

dimitarvdimitrov · 2023-12-16T21:20:19Z

This adds error handling for ingester errors. Client errors are only logged at warning level (like they are with regular gRPC ingestion) and ignored otherwise. Server errors trigger a backoff at the consumer; the backoff is unlimited and retries the same batch of records until it is successfully ingested.

This adds error handling for ingester errors. Client errors are only logged at warning level (like they are today). Server errors trigger a backoff at the consumer; the backoff is unlimited and retries the same batch of records until it is successfully ingested. Signed-off-by: Dimitar Dimitrov <[email protected]>

Signed-off-by: Dimitar Dimitrov <[email protected]>

pkg/storage/ingest/pusher.go

Signed-off-by: Dimitar Dimitrov <[email protected]>

pracucci · 2023-12-18T10:16:35Z

pkg/storage/ingest/reader.go

+	for boff.Ongoing() {
+		err := r.consumer.consume(ctx, records)
+		if err != nil {
+			level.Error(r.logger).Log(


We need a metric here, to be able to alert on it (we need to compute the failure rate).

Generally speaking, this for loop could be an infinite loop in case of a persistent error, so I would like to better understand how we're going to alert on it.

I was thinking that we alert on end-to-end latency. If the consume errors are infrequent enough we can probably get away with them without user-visible effects. Since we block consumption the failure rate will either be 0% or 100%, which doesn't sound very insightful.

pkg/storage/ingest/reader.go

pracucci · 2023-12-18T10:30:45Z

pkg/storage/ingest/pusher.go

-			// TODO move distributor's isClientError to a separate package and use that here to swallow only client errors and abort on others
-			continue
+			if !isClientIngesterError(err) {
+				return fmt.Errorf("consuming record at index %d for tenant %s: %w", recordIdx, wr.tenantID, err)


What's the actual value of knowing the index? Looks something unpredictable so I can't understand what we'll use it for.

my idea was to give enough information to be able to deduce the exact kafka record offset from the errors here and the error in the retry

mimir/pkg/storage/ingest/reader.go

Lines 185 to 189 in a5d37b9

"msg", "encountered error while ingesting data from Kafka; will retry",

"err", err,

"record_min_offset", minOffset,

"record_max_offset", maxOffset,

"num_retries", boff.NumRetries(),

pkg/storage/ingest/pusher.go

Signed-off-by: Dimitar Dimitrov <[email protected]>

pracucci

Thanks for addressing my feedback. I just have a last comment about the metric.

pracucci · 2023-12-18T11:53:45Z

pkg/storage/ingest/pusher.go

@@ -45,6 +49,14 @@ func newPusherConsumer(p Pusher, reg prometheus.Registerer, l log.Logger) *pushe
 			MaxAge:     time.Minute,
 			AgeBuckets: 10,
 		}),
+		clientErrRequests: promauto.With(reg).NewCounter(prometheus.CounterOpts{
+			Name: "cortex_ingest_storage_reader_client_error_requests_total",


There's a bit of naming mismatch with the rest of the code. In the other metrics we're calling it "records" and not "requests", which I think makes sense because it allows us to distinguish them with actual requests issued (e.g. requests to Kafka).

WDYT if we have:

cortex_ingest_storage_reader_records_failed_total: generalised in "failed" and then we differentiate whether it's a client or server error through a label, so we also keep the count of server errors?

cortex_ingest_storage_reader_records_total

done in bf4b27d

…x_ingest_storage_reader_records_total Signed-off-by: Dimitar Dimitrov <[email protected]>

pracucci

Thanks!

dimitarvdimitrov requested a review from a team as a code owner December 16, 2023 21:20

Remove rance in TestReader_ConsumerError

0fbfaeb

Signed-off-by: Dimitar Dimitrov <[email protected]>

dimitarvdimitrov force-pushed the dimitar/ingest/ingester-error-handling branch from 68ec2ed to 0fbfaeb Compare December 16, 2023 21:26

Remove outdated comment

f7bf45a

Signed-off-by: Dimitar Dimitrov <[email protected]>

dimitarvdimitrov changed the title ~~ingest consumer: handler Push errors~~ ingest consumer: handle Push errors Dec 16, 2023

pstibrany reviewed Dec 18, 2023

View reviewed changes

pkg/storage/ingest/pusher.go Outdated Show resolved Hide resolved

pstibrany approved these changes Dec 18, 2023

View reviewed changes

Only allow gRPC errors when running ingest-storage

81c49ce

Signed-off-by: Dimitar Dimitrov <[email protected]>

pracucci self-requested a review December 18, 2023 10:14

pracucci reviewed Dec 18, 2023

View reviewed changes

dimitarvdimitrov added 3 commits December 18, 2023 12:24

Simplify retries

86e7817

Signed-off-by: Dimitar Dimitrov <[email protected]>

Better error tracking and metrics

a5d37b9

Signed-off-by: Dimitar Dimitrov <[email protected]>

Shortcut consumeFetches if there aren't any fetches

3472dd1

Signed-off-by: Dimitar Dimitrov <[email protected]>

pracucci approved these changes Dec 18, 2023

View reviewed changes

Introduce cortex_ingest_storage_reader_records_failed_total and corte…

bf4b27d

…x_ingest_storage_reader_records_total Signed-off-by: Dimitar Dimitrov <[email protected]>

pracucci approved these changes Dec 18, 2023

View reviewed changes

dimitarvdimitrov merged commit 4170afa into main Dec 18, 2023

dimitarvdimitrov deleted the dimitar/ingest/ingester-error-handling branch December 18, 2023 13:13

dimitarvdimitrov mentioned this pull request Jan 10, 2024

ingest: include gRPC status code validation only when running ingester #7095

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest consumer: handle Push errors #6940

ingest consumer: handle Push errors #6940

dimitarvdimitrov commented Dec 16, 2023

pracucci Dec 18, 2023

pracucci Dec 18, 2023

dimitarvdimitrov Dec 18, 2023

pracucci Dec 18, 2023

dimitarvdimitrov Dec 18, 2023

pracucci left a comment

pracucci Dec 18, 2023

dimitarvdimitrov Dec 18, 2023

pracucci left a comment

	"msg", "encountered error while ingesting data from Kafka; will retry",
	"err", err,
	"record_min_offset", minOffset,
	"record_max_offset", maxOffset,
	"num_retries", boff.NumRetries(),

ingest consumer: handle Push errors #6940

ingest consumer: handle Push errors #6940

Conversation

dimitarvdimitrov commented Dec 16, 2023

pracucci Dec 18, 2023

Choose a reason for hiding this comment

pracucci Dec 18, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Dec 18, 2023

Choose a reason for hiding this comment

pracucci Dec 18, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Dec 18, 2023

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pracucci Dec 18, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Dec 18, 2023

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment