-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus remote write with Thanos out of order metrics stops metrics processing #9365
Comments
We've run into the same problem. In our specific case, AWS cloudwatch metrics were being ingested out of order due to the frequency of ingestion. We were able to fix this by collecting more regularly, but we are not convinced that will be a valid solution for all future use cases. The default FIFO ordering causes cortex to reject samples for the same basic reason Thanos rejects them. In our case, the metrics buffer eventually fills, causing the offending out of order metrics to be dropped. At that point, delivery of metrics resumes. A certain number of data points will be lost as collateral damage. Ideally, telegraf should order delivery by metric timestamp instead of FIFO. And it should have the capability of detecting and dropping out of order metrics itself. Perhaps the ability to detect and drop out of order and duplicate samples could be implemented using a processor? E.g. something similar to dedupe, except that it tracks the latest sample and drops both duplicate and older samples? As far as I know, remote_write operates on batches of metrics rather than individual metrics. I'm not sure if there is a mechanism to determine what metrics are being delivered out of order and to drop them individually. In our experience, it's pretty common for the entire block of metrics to be dropped. |
We see this same issue with our "edge" compute sites that report into thanos recieve. |
These out of order metrics will completely break setups where telegraf is remote writing to Thanos with the next release because this got merged thanos-io/thanos#5508. To figure out which label is responsible for this we compiled our own thanos with some addditional debug info:
Above shows that the special prometheus label We will try to figure out where this happens but any help is appreciated. |
FYI There is now an option for the HTTP output which can be used as a workaround:
|
@MyaLongmire does this work for you? Thanos
OS
Telegraf config[global_tags]
hostname = "myhostname"
host_ip = "__ip__"
host_network = "__ip__"
os = "debian"
os_major = "11"
telegraf_version = "1.24"
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = "1s"
logfile = "/dev/null"
omit_hostname = false
[[outputs.http]]
url = "https://thanos-dev-receive.example.com/api/v1/receive"
non_retryable_statuscodes = [409, 413]
use_batch_format = false <----- enabled/disabled doesnt matter
data_format = "prometheusremotewrite"
[outputs.http.headers]
Content-Type = "application/x-protobuf"
Content-Encoding = "snappy"
X-Prometheus-Remote-Write-Version = "0.1.0"
# -----------------------------------------------
# INPUTS
# -----------------------------------------------
[[inputs.bcache]]
bcachePath = "/sys/fs/bcache"
[[inputs.bond]]
[[inputs.conntrack]]
dirs = ["/proc/sys/net/netfilter"]
[[inputs.cpu]]
[[inputs.diskio]]
devices = ["sd*", "vd*"]
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
[[inputs.ipvs]]
[[inputs.processes]]
[[inputs.mdstat]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.nfsclient]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.ntpq]]
options = "-p"
[[inputs.kernel_vmstat]] Telegraf start
Thanos receiver logs{
"caller": "writer.go:163",
"component": "receive-writer",
"level": "warn",
"msg": "Error on series with out-of-order labels",
"numDropped": 825,
"tenant": "default-tenant",
"ts": "2022-09-30T13:19:56.282263846Z"
}
|
When pushing metrics to a Thanos receive endpoint if the metrics are out of order Thanos will return an HTTP 409 conflict response. Telegraf will then keep retrying the same batch of metrics until action is taken to resolve the issue. Thanos, however, expects the clients to understand 409 is a conflict and should not retry sending the metrics. This will cause new metrics which are then processed by telegraf to fill the buffer and not be delivered to Thanos until the conflict is resolved. See thanos-io/thanos#1509 (comment) where 409 is returned if metrics are out of order and the expected behavior for the remote write client is to not retry.
Relevant telegraf.conf:
System info:
Telegraf: 1.18.2
OS: Debian based docker container
Docker
Steps to reproduce:
sampledata.txt
:Expected behavior:
Ideally, I would love to see all 7 metrics received by Thanos, some subset would also be ok, however, to stop processing metrics and keep retrying prevents any new metrics from reaching Thanos. It would be ok if the conflicting metrics or even batch of metrics were dropped but the pipeline of metrics should not be stopped when a 409 is received.
Actual behavior:
When Thanos responds with an HTTP 409 telegraf will retry to send the batch of metrics. This causes the buffer to fill, preventing any new metrics from being delivered.
Additional info:
The text was updated successfully, but these errors were encountered: