Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sort and send all metric samples in a batch (prometheusremotewrite) #11683

Closed
wants to merge 2 commits into from

Conversation

jhychan
Copy link
Contributor

@jhychan jhychan commented Aug 15, 2022

Required for all PRs

resolves #11682
includes same fixes for #9365 (#11674)

This PR will ensure all metric samples are added to their time series samples object and sent on to the output plugin. Each prometheus time series object should contain all samples in the batch, sorted by the timestamp of each sample.

One area of concern is the Histogram type. The original serialiser code has logic to add zero values for missing fields that are part of a histogram metric (sum, count, inf bucket). I have had to remove these (temporarily commented out while in draft), as each histogram measurement produces extra sum/count/inf samples for every measurement.

I think the original code assumes that there will only be one period of histogram samples, so it only takes the first sample value. This assumption can't be made when batching metrics, there's no guarantee that a batch will only contain a single period of samples. So I'm thinking all that logic may either need to be removed, or rewritten properly to determine which set metrics belong to a given histogram sample period. Not too sure on the downstream impact of dropping this and assuming that the histogram metric should always be complete/valid.

FYI @aarnaud, @nvinzens

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Aug 15, 2022
@nvinzens
Copy link
Contributor

Awesome, thanks! I will go ahead and close my PR :)

@powersj
Copy link
Contributor

powersj commented Aug 18, 2022

@reimda,

Hoping you might be able to shed some history on the prometheusremotewrite serializer:

The serialiser builds a map of time series metrics where each metric key only contains a timeseries object only with a single sample. If a metric with a newer sample in the batch is found, it replaces the older sample. All older metrics are silently lost.

See #11682 for a bit more context. It does seem that several issues have been filed when multiple batches show up and then metrics. However, the README does have a big warning to this effect as well.

Thoughts on a way forward?

@reimda
Copy link
Contributor

reimda commented Aug 18, 2022

I'm not sure I can add much. The remote write protocol is different than the ordinary pull protocol and I'm not too familiar with it. I wasn't involved in the original PR adding this serializer. I'm not sure why the serializer drops metrics when there is more than one with the same metric key.

It does look like some users expect this behavior though, because there was a bugfix on the timestamp comparison: #10547

If dropping duplicates is a feature that some people expect and use, we can't just remove it. We need to track down why it was included in the first place before we decide what to do. @aarnaud contributed the plugin a couple years ago. If he is still available, maybe he can help us understand?

If dropping duplicates is useful for some use cases but not for others (like @jhychan's) we may need to add a setting so people can configure what they need.

@aarnaud
Copy link
Contributor

aarnaud commented Aug 19, 2022

Hi,
prometheusremotewrite was written to allow push metrics to cortexmetrics.io (that scale very well).
But it's also require to push metrics in order, so if we try to push metrics older than previous one. All the batch is refuse.
It's why that keep the most recent value.

Maybe there is better way to implement prometheusremotewrite but for now, it work well.

[[outputs.http]]
    url = "https://cortex.exemple.com/api/prom/push"
    data_format = "prometheusremotewrite"
    tls_ca = "ca.crt"
    tls_cert = "cert.pem"
    tls_key = "key.pem"
    non_retryable_statuscodes: [400]
    [outputs.http.headers]
      Content-Type = "application/x-protobuf"
      Content-Encoding = "snappy"
      X-Prometheus-Remote-Write-Version = "0.1.0"

@jhychan
Copy link
Contributor Author

jhychan commented Aug 19, 2022

I'm not sure I can add much. The remote write protocol is different than the ordinary pull protocol and I'm not too familiar with it. I wasn't involved in the original PR adding this serializer. I'm not sure why the serializer drops metrics when there is more than one with the same metric key.

Looking back at the original PR and the related issue, I think it dropped metrics because the code was a copy of the original prometheus serialiser code, which ultimately is a different design. Perhaps @aarnaud can confirm.

But historical context aside - if you consider the purposes of the two serialisers objectively:

The original serialiser code is built for the prometheus pull-based architecture, where a prometheus server regularly scrapes metrics that an agent exposes over a HTTP listener. The prometheus exporter (telegraf in this case), just needs to keep the metric samples up to date. There is no concept of scraping historical metrics - if prometheus fails to scrape metrics at any given time, there's no way to for it to ask the agent to go back and present those metrics. The exporter agent should only ever present the most up-to-date metrics at the time of scraping. This is why the prometheus serialiser drops all older samples and presents only the latest one.

In contrast, the remote write protocol is designed to be push-based, like Telegraf is with InfluxDB. So the agent collects metrics over time and writes them to the remote write receiver, each sample having a timestamp associated. The samples could be sent in batches, or could be sent one at a time, could be sent immediately after it was collected or could be sent 60 seconds from when it was collected. Naturally, these metric samples should be buffered and retries logic added in case of remote write failure. There's no longer a valid reason to drop older samples - all of them should be sent.

So I think this aspect was possibly just overlooked at the time of the original PR. In most cases, no one using this serialiser in batch mode is likely to have observed this as a problem as long as their remote writes are reliable, and their output flush interval is equal to or faster than the metric collection interval. However, if you collect more frequently than you flush, you are effectively not getting the level of time series resolution you expect from telegraf. Additionally, if you lose one or two samples within a 5 minute window, you won't even notice that you are losing samples in Cortex/Mimir as prometheus metrics have a 5 minute staleness marker by default. The only time you will really notice this is if there there is an extended outage - say 5 minutes or more. When things are restored, telegraf should be writing the entire backlog of metrics to the receiver, but the bug here is that it's only writing the latest sample in each batch.

Last thing to note - if you disable batch sends, then you don't lose any metrics. Each metric in a backlog is sent over to the receiver in its own protobuf payload. But this is pretty inefficient over the wire, as the collection of string labels in a time series is often much larger the sample values.

Regarding the histogram changes - I think it's also somewhat related. Prometheus scraping does some level of metric validation at scrape time, so it requires all histogram metrics to be valid and complete. I don't think the remote write protocol enforces such requirements as there's no guarantee it will receive all the metrics related to a histogram in a timely fashion. Which is why I suspect the histogram validation code probably doesn't apply to remote write.

Prometheus text-based exposition format:
Client

Prometheus remote write compliance:
https://prometheus.io/blog/2021/05/04/prometheus-conformance-remote-write-compliance/

Prometheus remote write protocol specification:
https://docs.google.com/document/d/1LPhVRSFkGNSuU1fBd81ulhsCPR4hkSZyyBj1SZ8fWOM

There is a suite of remote write validation compliance tests written by the team:
https://github.com/prometheus/compliance

Perhaps worth bringing over some of their tests.

@jhychan
Copy link
Contributor Author

jhychan commented Aug 19, 2022

prometheusremotewrite was written to allow push metrics to cortexmetrics.io (that scale very well).
But it's also require to push metrics in order, so if we try to push metrics older than previous one. All the batch is refuse.
It's why that keep the most recent value.

Making sure that that metric samples being pushed in chronological order is one thing - but dropping older samples in a batch is a different thing entirely. My PR collates all the samples and sorts them chronologically before sending, so within a batch, there should not be any out-of-order failures.

However there is always an issue with older metrics falling into a later batch, which no amount of client-side changes can avoid.

@aarnaud
Copy link
Contributor

aarnaud commented Aug 19, 2022

you totally right

Looking back at the original PR and the related issue, I think it dropped metrics because the code was a copy of the original prometheus serialiser code, which ultimately is a different design. Perhaps @aarnaud can confirm.

If I remember correctly, yes

@telegraf-tiger
Copy link
Contributor

@jhychan
Copy link
Contributor Author

jhychan commented Aug 23, 2022

After reading through the remote write spec (google doc link above), I've removed extra logic that tries to fill in any absent _sum, _count and +Inf _bucket metrics for Histograms. The atomicity of a Histogram is not part of the remote write spec, and the logic doesn't quite translate to batched metrics anyway.

@reimda @powersj
My PR is quite a big change from the existing code, so I understand that we should tread carefully here even though I do think the plugin is not correctly handling batches of metrics. Let me know if you think this should probably go into a new plugin, or perhaps make the "fixed" behaviour a completely separate configuration option, or something in between.

One question I have to ask about the existing plugin code and the issue generally:
This plugin produces different outputs when metrics are processed individually versus in batches. From your point of view, is this an issue? Or do you see this as perfectly valid behaviour?

Let's say we configure telegraf to collect metrics every 15 seconds resulting in 50 metrics. We configure this plugin to flush every 30 seconds. This means we should have 100 metrics to send to the output every 30s. Now, if we configure this plugin to process metrics individually, the output will receive 100 metrics. But if we configure this plugin to process the metrics in batches, the output will receive 50 metrics.

From my point of view, I don't think this is the right behaviour, but I don't think I have enough exposure to other output systems/protocols, so I'm keen to understand if there are valid scenarios for a serialiser/output plugin to produce different outputs if metrics are processed individually versus in batch. FYI the prometheus serialiser has the same behaviour.

@powersj
Copy link
Contributor

powersj commented Aug 23, 2022

make the "fixed" behaviour a completely separate configuration option

This is definitely the preferred option when making a breaking change like this.

I haven't dug back into this to answer the other questions. I was looking at one of the prometheus PRs and saw the comment about batching.

@reimda
Copy link
Contributor

reimda commented Aug 24, 2022

Thanks @aarnaud and @jhychan for providing context.

This plugin produces different outputs when metrics are processed individually versus in batches. From your point of view, is this an issue? Or do you see this as perfectly valid behaviour?

Looking into the serializer interface's evolution, it originally didn't have a batch mode. Batch mode was added to allow output formats that are not line oriented, not to give the serializer a chance to drop metrics. Specifically, SerializeBatch was added (#4107) to allow the json serializer to send more than one at a time. (see #3856) The json serializer has always handled all metrics in the batch. I think it's always been implied that batch mode should handle all metrics in the batch.

It looks like @aarnaud's use case forced the choice to drop metrics. Dropping duplicates was a good way to make it work with cortexmetrics.io's requirements, but it doesn't quite follow the implied requirements of telegraf's serializer interface.

Looking back, it probably would have been cleaner to do use a separate processor instance to drop duplicates in the batch. Really the serializer should only be doing one simple job, serializing.

Since we have backward compatibility to worry about, I think the best thing to do is add a setting. I would imagine its toml would look like this:

# How the serializer should handle batched metrics
# valid options:
#  "most_recent" Pass only the most recent metric
#  "sorted" Pass all metrics but sort the batch
#  "unmodified" Pass all metrics in the order they were received
#prometheusremotewrite_batching = "most_recent"

We can change the names if anyone has better ideas. Using a string setting here instead of bool lets us add other settings in the future if we need to. It should default to "most_recent" so config files that don't have the setting at all will act like they did in previous versions of telegraf.

If in the future we decide that the default should be "sorted" instead of "most_recent" we can make that change at a major release (2.x, 3.x, etc) where compatibility isn't guaranteed. We could also deprecate "most_recent" if we are sure another option like "sorted" is better in every way.

Maybe it would be better not to have the serializer sort metrics at all, but to use a processor to sort them before they arrive at the serializer. Again, a serializer really should only serialize. In that case there wouldn't be a "sorted" option.

@jhychan
Copy link
Contributor Author

jhychan commented Aug 25, 2022

Awesome, thanks! I will go ahead and close my PR :)

@nvinzens you may want to re-open your PR to get your fix in. I think this bug fix will likely take a while longer.

@nvinzens
Copy link
Contributor

nvinzens commented Sep 2, 2022

I created an alternative PR for #11674 with a more robust/intrusive solution in #11755.

@powersj
Copy link
Contributor

powersj commented Oct 20, 2022

I'm going to close this PR for now since #11755 got merged. If you do make further work or changed based on @reimda's response feel free to re-open. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus remote write serialiser drops buffered metrics
5 participants