Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New metric for measuring query duration proposal #4932

Merged
merged 8 commits into from
Dec 23, 2021

Conversation

moadz
Copy link
Contributor

@moadz moadz commented Dec 8, 2021

Signed-off-by: Moad Zardab [email protected]
Related to #4895

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Signed-off-by: mzardab <[email protected]>
@yeya24
Copy link
Contributor

yeya24 commented Dec 11, 2021

That's a really nice improvement on query latency observability, especially with the new query pushdown feature 👍

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great proposal, but as I mentioned offline, it would be awesome to take this latency number right when we return query PromQL response to the user.

This might have to be changed if we try to push query eval to Query Frontent, but no clear proposal for now so let's go for this!

@moadz moadz changed the title New metric for measuring series fanout duration proposal New metric for measuring proxy StoreAPI select duration proposal Dec 22, 2021
@moadz moadz changed the title New metric for measuring proxy StoreAPI select duration proposal New metric for measuring query duration proposal Dec 23, 2021
Comment on lines +258 to +265
* Amend the `seriesServer` to keep track of all `SeriesStats` for each series pushed to it
* Amend the static `qapi.queryableCreate` to take a `SeriesStatsReporter` func parameter that will exfiltrate the seriesStats from the Thanos Proxy StoreAPI
* Add new runtime flags that will allow us to specify a) Query time quantiles b) Series size quantiles c) Sample size quantiles for our partitioned histogram
* Start a query duration timer as soon as the handler is hit
* Create a new partitioned vector histogram called `thanos_query_duration_seconds` in the `queryRange` API handler
* Propagate all exfiltrated `SeriesStats` to aforementioned metric
* Record observations against the `thanos_query_duration_seconds` histogram after bucketing samples_le/series_le buckets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka amended to include the entire query path (with stats exfiltration)

* Do not want to create separate histograms for each individual store query, so they will need to be aggregated at the `Series` request level so that our observations include all
* Do not want to block series receive on a `seriesStats` merging mutex for each incoming response, so maintaining a central `seriesStats` reference and passing it into each of the proxied store requests is out of the question

### Why can't we capture the query shape & latency spanning the entire query path?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ( :


**tl;dr:** Longer term to capture the entire query path by amending the Prometheus Querier API to return some stats alongside the query, and creating this generic metric inside the Prometheus PromQL engine. Short term, pass a func parameter to the Queryable constructor for the proxy StoreAPI querier that will exfiltrate the `SeriesStats`, circumventing PromQL engine.

### Measuring Thanos Query Latency with respect to query fanout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Measuring Thanos Query Latency with respect to query fanout
### Measuring Thanos Query Latency with respect to query fanout

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! LGTM.

Next time no need for so many details, but I think it's an amazing exercise for a start 👍🏽

@bwplotka bwplotka enabled auto-merge (squash) December 23, 2021 15:51
@bwplotka bwplotka disabled auto-merge December 23, 2021 16:27
@bwplotka
Copy link
Member

Looks perfect. Only two flakes on CI, so merging, unless there are objections from others 👍🏽

@bwplotka bwplotka merged commit cb44778 into thanos-io:main Dec 23, 2021
@clyang82
Copy link
Contributor

Documentation check was not passed so that the following PRs will be failed. @moadz Could you help to fix the Documentation check issue? Thanks.

@moadz
Copy link
Contributor Author

moadz commented Jan 4, 2022

@clyang82 apologies, just saw that this was merged. Thanks @hanjm for rectifying in b828d00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants