-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos Store should always prefer higher resolution data when possible #1170
Comments
Interesting question, let's discuss it. I think the current implementation is what we need because:
No, because auto downsampling decide based on step. So you can query for 1h range, 14 days ago and you touch raw data (if you have those).
Let me stop you right here, this will always not work with downsampled data because your rate interval is too low. With 5m resolution and 1h resolution you probably have not enough samples to calculate rate. You should use
Sorry, don't get why the gap? Due to problem described above? |
Sorry for deleting the last message, had to recheck this with the latest version. Essentially, let's use a picture because as we know one picture is worth a thousand words. For example, imagine if we have:
and Would return a nice, continuous line since we do actually have high resolution data in remote object storage. However, it does not due to the behavior of the function described. |
Sure but you don't want to fall back to raw data ONLY because someone made a mistake and query 1 year data with That's why downsampling exists to avoid querying raw if not needed. Maybe we should think about some warnings? That would pop up on Grafana? We could deduce this mistake (: |
But I get the problem now @GiedriusS thanks for picture. Wonder what others are thinking about thins problem. |
I had an idea that maybe if Grafana will have Thanos integration, and we will have the ability to select min/max resolution then this should get solved automatically without any change because it would become evident what's happening "under the hood" then to the person who is doing a query. |
I'm trying to think of a way where we don't need an explicit integration as Grafana wasn't the biggest fan of this and it also makes the "upgrade your Prometheus to Thanos" argument weaker. Maybe we can make use of Prometheus warnings in the query result? I believe Grafana renders those. For example if |
Warning would be nice enough, but that means also we have additional argument to drive warnings support in Grafana closely. (: |
I think maybe there are two issues here: Now (2) feeds in to (1) as preferring high resolution data means we see the issue in (1) a bit less. But (1) is going to be an issue anyway, because a user reasonably could make a request for IMO: |
Another point to consider: you can write a "bare" query like: |
^ That's the reason I can't go to production with Thanos yet. Not sure it's the job of Grafana to be aware of problems showing Handling this at the data source may get around another issue: if I use Trickster to cache data, and query for |
@raffraffraff, sorry for the delay (:
Please see this doc to read about downsampling use cases. The use case is NOT to zoom in into old data, but rather query long time ranges. We were not clear enough with this from the beginning, sorry for that.
Yup, that's why we plan to contribute more to Cortex Cache for now to allow better caching. It splits per day so it won't have that issue. Right now it works fine against Thanos but will not really use downsampling data as the step most likely is too low (as we hit the issue we discuss in this ticket). Ideally, related to the issue I think we should actually think about assuming that you have raw data all the time. I think we may consider even dropping different retentions per resolution, but let's discuss it in another ticket. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I am wondering what the current plan is. The current behavior, which chooses the lowest resolution and falls back on higher resolutions basically renders using Grafana to query historical data infeasible. We only use Store to query 5m- and 1h- resolution data and when both resolutions exist we would want to use the 5m-resolution. Raw data is only served by Sidecar as it would be After going over all the down-sampling related issues in #1705, I can't seem to find what we are trying to do to make Thanos prefer higher resolution data when different resolutions exist. Did I miss anything? |
Hi @bwplotka
We use Grafana query Thanos data, and our user's configuration setting the fixed step in their dashboards. So when the zoom out, the range increases but the step keeps unchanged, we can't serve the long-range query by downsampled data because of raw data in that range can't be returned by Thanos store in a single query. So, What should be the best practice for both short-range(e.g now-2d) and long-range(e.g now-20d) queries? And can the step be adapted step by the query range in Grafana? Thanks for your suggestions. :) |
Currently,
bucket.getFor()
first tries to select data with the smallest resolution. Only after that is done, it jumps to the data with bigger resolution. I think that this should be changed because it's counter-intuitive.Rationale
Range vectors would work more predictably. In our case, we retain RAW data for 31 days, data downsampled to 5 minutes is retained for 91 days, and data downsampled to 1 hour is retained for 1.5 years. Note that retention policies are a completely separate thing from when we actually perform downsampling. Those two things are defined here: https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/downsample.go#L172 and https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/downsample.go#L193 i.e. 5m blocks are carved out after the block becomes longer than 40 hours, and 1h blocks are carved out after they are longer than 10 days (240 hours).
Compaction happens at these block sizes: https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/compact.go#L32. So, this practically means that after 2 days from the current moment you will only get 5m downsampled data, and after 14 days - only 1 hour downsampled data.
Now, imagine writing a query like:
rate(http_request_total[5m])
(as suggested by Grafana's Explore UI) and you will want to execute it on a time range: fromnow
tillnow-20d
. It might very well happen that you will start seeing gaps after 14 days in your dashboard. In my opinion, an user would expect to see a nice, continuous graph with such a query even with a 20 day time range. It is a bit counter-intuitive to see gaps due to "missing data" because of howbucket.getFor()
selects data.The only caveat I see is higher RAM usage in Thanos Store but that could be helped with the other, on-going work of being able to select the minimum resolution.
Thoughts?
The text was updated successfully, but these errors were encountered: