Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument sink batching #9719

Open
spencergilbert opened this issue Oct 20, 2021 · 10 comments
Open

Instrument sink batching #9719

spencergilbert opened this issue Oct 20, 2021 · 10 comments
Labels
domain: observability Anything related to monitoring/observing Vector domain: sinks Anything related to the Vector's sinks type: feature A value-adding code addition that introduce new functionality.

Comments

@spencergilbert
Copy link
Contributor

Following on our buffer instrumentation, we should also look to instrument batches. Given that batches are flushed on a number of conditions additional insight is helpful for operators to optimize their pipelines.

batch sizing and number in-flight, etc

@spencergilbert spencergilbert added the type: feature A value-adding code addition that introduce new functionality. label Oct 20, 2021
@hhromic
Copy link
Contributor

hhromic commented Oct 20, 2021

This would be a very welcome feature for our team. Specially coupled with the recent instrumentation for Buffers.
We heavily make use of the HTTP Sink and our Vector pipelines are often scaled to multiple replicas. Therefore, for tuning, it is very important to see how big are the buffers/batches being used by the HTTP Sink, specially in real-time.

For Batches, probably these metrics should at least be considered:

  • batch_events: number of events in the batch.
  • batch_byte_size: number of bytes in the batch.
  • batch_age: amount of time the batch has been under construction.

Due to concurrency and each http request having its own batch to send, maybe instead of gauges they could be histograms, but I'm not sure what you think is better. I don't think batches have the notion of dropped events (as this is for Buffers), so no need to instrument that metric I guess.

In addition, it would be very useful to also export the configured max_events, max_bytes_size and timeout_secs values (similar to Buffers instrumentation). This would allow to calculate percentages too, i.e. how full in average are the batches?.

@jszwedko jszwedko added domain: observability Anything related to monitoring/observing Vector domain: sinks Anything related to the Vector's sinks labels Oct 20, 2021
@tobz
Copy link
Contributor

tobz commented Dec 8, 2021

Porting over some of the details from a duplicate ticket that I listed, these are the metrics I would want to see come out of any work to add metrics to the batching process:

  • (gauge) total number of pending batches
  • (gauge) total number of events in pending batches
  • (gauge) total size of pending batches
  • (histogram) batch TTL (how long a batch lives before being flushed, either due to max limits or timeout)
  • (counter) total batches created
  • (counter) total batches flushed, by status (did it hit max bytes? max events? timeout?)

(Some of these overlap with @hhromic's comment, obviously.)

@zsherman zsherman added this to the Vector 0.19.0 milestone Dec 13, 2021
@jszwedko jszwedko removed this from the Vector 0.19.0 milestone Jan 12, 2022
@csjiang
Copy link

csjiang commented Nov 21, 2022

hi! any updates here? this type of metric would be super-useful to us.

@bruceg
Copy link
Member

bruceg commented Nov 21, 2022

No, we have not yet prioritized this work.

@jszwedko
Copy link
Member

Suggestions from a user here: #20284

@fpytloun
Copy link
Contributor

I think also docs for buffering and batching should be extended as it is also very difficult to understand relation between buffer and batch limits.

If user configures max batch size to be larger than buffer size, what will happen? If there's direct relation, then Vector should throw a warning if buffer size is lower than batch limit and should recommend buffer size being at least same as batch size or in multiples of batch size (eg. batch size * 2 to have some read-ahead from sources).

Also how does this work when ARC or concurrency > 1 is being used (then it should be batch size * expected max concurrency).

And last thing, if user has one source (eg. kafka) and multiple sinks (ES, S3) with different buffer and batch size. Will sink with smaller buffer throttle the other one?

@jszwedko
Copy link
Member

I think also docs for buffering and batching should be extended as it is also very difficult to understand relation between buffer and batch limits.

Agreed, the docs could be expanded. Putting some responses here in the meanwhile.

If user configures max batch size to be larger than buffer size, what will happen? If there's direct relation, then Vector should throw a warning if buffer size is lower than batch limit and should recommend buffer size being at least same as batch size or in multiples of batch size (eg. batch size * 2 to have some read-ahead from sources).

Sink buffers are decoupled from batching. That is: the buffer just feeds events into the sink as it gets them, and as the sink fetches them. The sink then batches those events in-memory.

There is this diagram that might help: https://vector.dev/docs/reference/configuration/sinks/vector/#buffers-and-batches

Also how does this work when ARC or concurrency > 1 is being used (then it should be batch size * expected max concurrency).

Again the buffers are decoupled from the in-memory batching so the buffer size doesn't need to be related to the batch size. You can expect one batch to be created per concurrency though.

And last thing, if user has one source (eg. kafka) and multiple sinks (ES, S3) with different buffer and batch size. Will sink with smaller buffer throttle the other one?

If the small buffer is full, yes, it will apply back-pressure before the larger buffer does. Again, though, batching is done in memory and is decoupled from buffering.

@fpytloun
Copy link
Contributor

@jszwedko thank you, that is what I barely remember from Discord discussion some time ago. Still I think it might be beneficial for buffer to be larger than batch size to have some read-ahead depending on the source, correct?
Is there some metric or log message to know when vector applied backpressure due to slow sink. I don't remember seeing any log message for such event 🤔

@jszwedko
Copy link
Member

@jszwedko thank you, that is what I barely remember from Discord discussion some time ago. Still I think it might be beneficial for buffer to be larger than batch size to have some read-ahead depending on the source, correct? Is there some metric or log message to know when vector applied backpressure due to slow sink. I don't remember seeing any log message for such event 🤔

The utilization metric is the best one that currently exists for identifying back pressure.

Thinking about it a bit more, for in-memory buffers, I think I could see it being beneficial to have the buffer be at least as big as the batch size multiplied by the concurrency so that the next set of requests could be buffered in memory while the current set is in flight.

For disk buffers, I think having it be 2x would be beneficial since data isn't "deleted" from disk buffers until the sink delivers it.

@timfehr
Copy link

timfehr commented Nov 6, 2024

Is there any news of the current status for this feature? This metric would be also really beneficial for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: observability Anything related to monitoring/observing Vector domain: sinks Anything related to the Vector's sinks type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

No branches or pull requests

9 participants