-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: sort and send all metric samples in a batch (prometheusremotewrite) #11683
Conversation
Awesome, thanks! I will go ahead and close my PR :) |
Hoping you might be able to shed some history on the
See #11682 for a bit more context. It does seem that several issues have been filed when multiple batches show up and then metrics. However, the README does have a big warning to this effect as well. Thoughts on a way forward? |
I'm not sure I can add much. The remote write protocol is different than the ordinary pull protocol and I'm not too familiar with it. I wasn't involved in the original PR adding this serializer. I'm not sure why the serializer drops metrics when there is more than one with the same metric key. It does look like some users expect this behavior though, because there was a bugfix on the timestamp comparison: #10547 If dropping duplicates is a feature that some people expect and use, we can't just remove it. We need to track down why it was included in the first place before we decide what to do. @aarnaud contributed the plugin a couple years ago. If he is still available, maybe he can help us understand? If dropping duplicates is useful for some use cases but not for others (like @jhychan's) we may need to add a setting so people can configure what they need. |
Hi, Maybe there is better way to implement
|
Looking back at the original PR and the related issue, I think it dropped metrics because the code was a copy of the original prometheus serialiser code, which ultimately is a different design. Perhaps @aarnaud can confirm. But historical context aside - if you consider the purposes of the two serialisers objectively: The original serialiser code is built for the prometheus pull-based architecture, where a prometheus server regularly scrapes metrics that an agent exposes over a HTTP listener. The prometheus exporter (telegraf in this case), just needs to keep the metric samples up to date. There is no concept of scraping historical metrics - if prometheus fails to scrape metrics at any given time, there's no way to for it to ask the agent to go back and present those metrics. The exporter agent should only ever present the most up-to-date metrics at the time of scraping. This is why the prometheus serialiser drops all older samples and presents only the latest one. In contrast, the remote write protocol is designed to be push-based, like Telegraf is with InfluxDB. So the agent collects metrics over time and writes them to the remote write receiver, each sample having a timestamp associated. The samples could be sent in batches, or could be sent one at a time, could be sent immediately after it was collected or could be sent 60 seconds from when it was collected. Naturally, these metric samples should be buffered and retries logic added in case of remote write failure. There's no longer a valid reason to drop older samples - all of them should be sent. So I think this aspect was possibly just overlooked at the time of the original PR. In most cases, no one using this serialiser in batch mode is likely to have observed this as a problem as long as their remote writes are reliable, and their output flush interval is equal to or faster than the metric collection interval. However, if you collect more frequently than you flush, you are effectively not getting the level of time series resolution you expect from telegraf. Additionally, if you lose one or two samples within a 5 minute window, you won't even notice that you are losing samples in Cortex/Mimir as prometheus metrics have a 5 minute staleness marker by default. The only time you will really notice this is if there there is an extended outage - say 5 minutes or more. When things are restored, telegraf should be writing the entire backlog of metrics to the receiver, but the bug here is that it's only writing the latest sample in each batch. Last thing to note - if you disable batch sends, then you don't lose any metrics. Each metric in a backlog is sent over to the receiver in its own protobuf payload. But this is pretty inefficient over the wire, as the collection of string labels in a time series is often much larger the sample values. Regarding the histogram changes - I think it's also somewhat related. Prometheus scraping does some level of metric validation at scrape time, so it requires all histogram metrics to be valid and complete. I don't think the remote write protocol enforces such requirements as there's no guarantee it will receive all the metrics related to a histogram in a timely fashion. Which is why I suspect the histogram validation code probably doesn't apply to remote write. Prometheus text-based exposition format: Prometheus remote write compliance: Prometheus remote write protocol specification: There is a suite of remote write validation compliance tests written by the team: Perhaps worth bringing over some of their tests. |
Making sure that that metric samples being pushed in chronological order is one thing - but dropping older samples in a batch is a different thing entirely. My PR collates all the samples and sorts them chronologically before sending, so within a batch, there should not be any out-of-order failures. However there is always an issue with older metrics falling into a later batch, which no amount of client-side changes can avoid. |
you totally right
If I remember correctly, yes |
Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
After reading through the remote write spec (google doc link above), I've removed extra logic that tries to fill in any absent _sum, _count and +Inf _bucket metrics for Histograms. The atomicity of a Histogram is not part of the remote write spec, and the logic doesn't quite translate to batched metrics anyway. @reimda @powersj One question I have to ask about the existing plugin code and the issue generally: Let's say we configure telegraf to collect metrics every 15 seconds resulting in 50 metrics. We configure this plugin to flush every 30 seconds. This means we should have 100 metrics to send to the output every 30s. Now, if we configure this plugin to process metrics individually, the output will receive 100 metrics. But if we configure this plugin to process the metrics in batches, the output will receive 50 metrics. From my point of view, I don't think this is the right behaviour, but I don't think I have enough exposure to other output systems/protocols, so I'm keen to understand if there are valid scenarios for a serialiser/output plugin to produce different outputs if metrics are processed individually versus in batch. FYI the prometheus serialiser has the same behaviour. |
This is definitely the preferred option when making a breaking change like this. I haven't dug back into this to answer the other questions. I was looking at one of the prometheus PRs and saw the comment about batching. |
Thanks @aarnaud and @jhychan for providing context.
Looking into the serializer interface's evolution, it originally didn't have a batch mode. Batch mode was added to allow output formats that are not line oriented, not to give the serializer a chance to drop metrics. Specifically, SerializeBatch was added (#4107) to allow the json serializer to send more than one at a time. (see #3856) The json serializer has always handled all metrics in the batch. I think it's always been implied that batch mode should handle all metrics in the batch. It looks like @aarnaud's use case forced the choice to drop metrics. Dropping duplicates was a good way to make it work with cortexmetrics.io's requirements, but it doesn't quite follow the implied requirements of telegraf's serializer interface. Looking back, it probably would have been cleaner to do use a separate processor instance to drop duplicates in the batch. Really the serializer should only be doing one simple job, serializing. Since we have backward compatibility to worry about, I think the best thing to do is add a setting. I would imagine its toml would look like this: # How the serializer should handle batched metrics
# valid options:
# "most_recent" Pass only the most recent metric
# "sorted" Pass all metrics but sort the batch
# "unmodified" Pass all metrics in the order they were received
#prometheusremotewrite_batching = "most_recent" We can change the names if anyone has better ideas. Using a string setting here instead of bool lets us add other settings in the future if we need to. It should default to "most_recent" so config files that don't have the setting at all will act like they did in previous versions of telegraf. If in the future we decide that the default should be "sorted" instead of "most_recent" we can make that change at a major release (2.x, 3.x, etc) where compatibility isn't guaranteed. We could also deprecate "most_recent" if we are sure another option like "sorted" is better in every way. Maybe it would be better not to have the serializer sort metrics at all, but to use a processor to sort them before they arrive at the serializer. Again, a serializer really should only serialize. In that case there wouldn't be a "sorted" option. |
@nvinzens you may want to re-open your PR to get your fix in. I think this bug fix will likely take a while longer. |
Required for all PRs
resolves #11682
includes same fixes for #9365 (#11674)
This PR will ensure all metric samples are added to their time series samples object and sent on to the output plugin. Each prometheus time series object should contain all samples in the batch, sorted by the timestamp of each sample.
One area of concern is the Histogram type. The original serialiser code has logic to add zero values for missing fields that are part of a histogram metric (sum, count, inf bucket). I have had to remove these (temporarily commented out while in draft), as each histogram measurement produces extra sum/count/inf samples for every measurement.
I think the original code assumes that there will only be one period of histogram samples, so it only takes the first sample value. This assumption can't be made when batching metrics, there's no guarantee that a batch will only contain a single period of samples. So I'm thinking all that logic may either need to be removed, or rewritten properly to determine which set metrics belong to a given histogram sample period. Not too sure on the downstream impact of dropping this and assuming that the histogram metric should always be complete/valid.
FYI @aarnaud, @nvinzens