Instrument sink batching #9719

spencergilbert · 2021-10-20T16:13:12Z

Following on our buffer instrumentation, we should also look to instrument batches. Given that batches are flushed on a number of conditions additional insight is helpful for operators to optimize their pipelines.

batch sizing and number in-flight, etc

hhromic · 2021-10-20T16:35:02Z

This would be a very welcome feature for our team. Specially coupled with the recent instrumentation for Buffers.
We heavily make use of the HTTP Sink and our Vector pipelines are often scaled to multiple replicas. Therefore, for tuning, it is very important to see how big are the buffers/batches being used by the HTTP Sink, specially in real-time.

For Batches, probably these metrics should at least be considered:

batch_events: number of events in the batch.
batch_byte_size: number of bytes in the batch.
batch_age: amount of time the batch has been under construction.

Due to concurrency and each http request having its own batch to send, maybe instead of gauges they could be histograms, but I'm not sure what you think is better. I don't think batches have the notion of dropped events (as this is for Buffers), so no need to instrument that metric I guess.

In addition, it would be very useful to also export the configured max_events, max_bytes_size and timeout_secs values (similar to Buffers instrumentation). This would allow to calculate percentages too, i.e. how full in average are the batches?.

tobz · 2021-12-08T16:55:06Z

Porting over some of the details from a duplicate ticket that I listed, these are the metrics I would want to see come out of any work to add metrics to the batching process:

(gauge) total number of pending batches
(gauge) total number of events in pending batches
(gauge) total size of pending batches
(histogram) batch TTL (how long a batch lives before being flushed, either due to max limits or timeout)
(counter) total batches created
(counter) total batches flushed, by status (did it hit max bytes? max events? timeout?)

(Some of these overlap with @hhromic's comment, obviously.)

csjiang · 2022-11-21T23:18:49Z

hi! any updates here? this type of metric would be super-useful to us.

bruceg · 2022-11-21T23:25:02Z

No, we have not yet prioritized this work.

jszwedko · 2024-04-11T13:55:07Z

Suggestions from a user here: #20284

fpytloun · 2024-04-12T09:27:50Z

I think also docs for buffering and batching should be extended as it is also very difficult to understand relation between buffer and batch limits.

If user configures max batch size to be larger than buffer size, what will happen? If there's direct relation, then Vector should throw a warning if buffer size is lower than batch limit and should recommend buffer size being at least same as batch size or in multiples of batch size (eg. batch size * 2 to have some read-ahead from sources).

Also how does this work when ARC or concurrency > 1 is being used (then it should be batch size * expected max concurrency).

And last thing, if user has one source (eg. kafka) and multiple sinks (ES, S3) with different buffer and batch size. Will sink with smaller buffer throttle the other one?

jszwedko · 2024-04-12T20:13:06Z

I think also docs for buffering and batching should be extended as it is also very difficult to understand relation between buffer and batch limits.

Agreed, the docs could be expanded. Putting some responses here in the meanwhile.

If user configures max batch size to be larger than buffer size, what will happen? If there's direct relation, then Vector should throw a warning if buffer size is lower than batch limit and should recommend buffer size being at least same as batch size or in multiples of batch size (eg. batch size * 2 to have some read-ahead from sources).

Sink buffers are decoupled from batching. That is: the buffer just feeds events into the sink as it gets them, and as the sink fetches them. The sink then batches those events in-memory.

There is this diagram that might help: https://vector.dev/docs/reference/configuration/sinks/vector/#buffers-and-batches

Also how does this work when ARC or concurrency > 1 is being used (then it should be batch size * expected max concurrency).

Again the buffers are decoupled from the in-memory batching so the buffer size doesn't need to be related to the batch size. You can expect one batch to be created per concurrency though.

And last thing, if user has one source (eg. kafka) and multiple sinks (ES, S3) with different buffer and batch size. Will sink with smaller buffer throttle the other one?

If the small buffer is full, yes, it will apply back-pressure before the larger buffer does. Again, though, batching is done in memory and is decoupled from buffering.

fpytloun · 2024-04-15T07:11:46Z

@jszwedko thank you, that is what I barely remember from Discord discussion some time ago. Still I think it might be beneficial for buffer to be larger than batch size to have some read-ahead depending on the source, correct?
Is there some metric or log message to know when vector applied backpressure due to slow sink. I don't remember seeing any log message for such event 🤔

jszwedko · 2024-04-16T18:56:50Z

@jszwedko thank you, that is what I barely remember from Discord discussion some time ago. Still I think it might be beneficial for buffer to be larger than batch size to have some read-ahead depending on the source, correct? Is there some metric or log message to know when vector applied backpressure due to slow sink. I don't remember seeing any log message for such event 🤔

The utilization metric is the best one that currently exists for identifying back pressure.

Thinking about it a bit more, for in-memory buffers, I think I could see it being beneficial to have the buffer be at least as big as the batch size multiplied by the concurrency so that the next set of requests could be buffered in memory while the current set is in flight.

For disk buffers, I think having it be 2x would be beneficial since data isn't "deleted" from disk buffers until the sink delivers it.

timfehr · 2024-11-06T07:01:03Z

Is there any news of the current status for this feature? This metric would be also really beneficial for us.

spencergilbert added the type: feature A value-adding code addition that introduce new functionality. label Oct 20, 2021

jszwedko added domain: observability Anything related to monitoring/observing Vector domain: sinks Anything related to the Vector's sinks labels Oct 20, 2021

jszwedko mentioned this issue Nov 30, 2021

buffer_events of internal_metrics is always zero #10174

Closed

jszwedko mentioned this issue Dec 8, 2021

enhancement(sinks): add batching metrics #10347

Closed

zsherman added this to the Vector 0.19.0 milestone Dec 13, 2021

jszwedko removed this from the Vector 0.19.0 milestone Jan 12, 2022

jszwedko mentioned this issue Jul 6, 2023

Vector batch bytes limits are based on in-memory sizing of events #10020

Open

dsmith3197 mentioned this issue Dec 12, 2023

The PartitionedBatcher used in some sinks can result in unbounded memory usage #19345

Open

jszwedko mentioned this issue Apr 11, 2024

Metrics to give batching insight and help user with optimizations #20284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrument sink batching #9719

Instrument sink batching #9719

spencergilbert commented Oct 20, 2021

hhromic commented Oct 20, 2021

tobz commented Dec 8, 2021

csjiang commented Nov 21, 2022

bruceg commented Nov 21, 2022

jszwedko commented Apr 11, 2024

fpytloun commented Apr 12, 2024

jszwedko commented Apr 12, 2024

fpytloun commented Apr 15, 2024

jszwedko commented Apr 16, 2024

timfehr commented Nov 6, 2024

Instrument sink batching #9719

Instrument sink batching #9719

Comments

spencergilbert commented Oct 20, 2021

hhromic commented Oct 20, 2021

tobz commented Dec 8, 2021

csjiang commented Nov 21, 2022

bruceg commented Nov 21, 2022

jszwedko commented Apr 11, 2024

fpytloun commented Apr 12, 2024

jszwedko commented Apr 12, 2024

fpytloun commented Apr 15, 2024

jszwedko commented Apr 16, 2024

timfehr commented Nov 6, 2024