-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High error rate and choking of receives during load test #5452
Comments
Thanks for the report. One thing worth checking is CPU usage. CPU might be simply saturated, causing slowdowns. If not CPU, then lock contention might be issue. In both cases profiles would be amazing. |
Also do I see correctly we see 6 x 10 millions head chunks? Perhaps our replication is too naive and picks series randomly (what @moadz suggested at some point)? |
Also if we have any repro scripts, it would be amazing to link here (: |
60 million head chunks is going to be right here as the script is producing 20M active series and we have a replication factor of 3. |
ack, makes sense, I thought we have 10M pushed |
It is worth checking |
@fpetkovski thanks for the pointer, looks like in this case there was some resourcing issues and some of the slowness at least is taken care of by #5566. I'm going to close this based on this comment and will reopen if I notice again despite having additional resources. |
Thanos, Prometheus and Golang version used:
Same behaviour tested with both
v0.27.0-rc.0
andv0.25.2
What happened:
We notice issues when load testing Thanos Receive to handle twenty million active series at 2 DPM.
Relevant config from receiver:
Note, that the load test has had several successful runs when the replication factor was set to 1, but appears to reproducible at will given the above configuration.
We have 6 receive replicas running on r5.2xlarge instances.
They schedule with memory requests of 55GiB and 64GiB limits.
We don't observe any OOMKilled events during the high error rate.
We burn through our error budget within minutes at which point I killed the load test.
We have Jaeger running in the cluster and I have taken a screenshot of a sample of requests taking in excess of 2m (the forward timeout)
We can see from the span there are issues with writing to the TSDB.
Memory usage appears well within our resource constraints:
We see the receiver logs spammed with the following:
Followed by a spamming of:
This all looks similar to what I see reported in #4831, but it does appear to be an impact of the replication factor.
As I said, we can reproduce this at will so let me know if there is anything else that I can provide which is useful to assist in the investigation.
The text was updated successfully, but these errors were encountered: