-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Query Insights] Capture query-level resource usage metrics #12399
Comments
Draft for the proposed solution: #12449
|
@ansjcy Thanks for the above solution and the proposed alternative. Did we get some numbers regarding how much percentage lower cpu and memory we can expect with the proposed approach? In the proposed approach, seems like if we stop the task and resource tracking before sending the result back to the co-ordinator node we will end up increasing the overall search latency which is undesirable. For the alternative solution, can we piggyback any other background jobs that periodically share data to other nodes? |
@ansjcy
|
Another draft for the background job approach: #12473 |
Thanks for the comments!
Based on my measurements of the approach (#12449), the measured result would be ~10% lower than the accurate result.
Yes, this is another concern I have. But the latency impact might be ignorable since we are just reading one value and append a new field, both are O(1) operations. But yes, we need to do more benchmarking to understand the performance impact.
Yes, this would be a benefit of this approach - we can add customized data in the future, without worrying about adding impact to the search requests.
As I mentioned in the description, there are 2 ways to do this, we either piggyback the resource usages on each node in
A task is considered as
Agreed! I have a draft for this approach, exactly as you descirbed :) #12473 |
I had a doubt on this ie whether a simple sum is the right way to portray top N expensive queries. Consider two queries(on a 2 node cluster): 1st query: 2nd query: 2nd query consumed less CPU overall but had more impact as took down one of the data node. So if a user says give me the top expensive query, should we still return 1st query? Doesn't seem very right. |
@ansjcy would do you think about the approach that separates tasks and queries: we do track task resource utilization and there is pretty clear link between query (or better to say, search) and the tasks its execution spawns across the nodes. If we capture the tasks information separately (as it is now) but made it available after query finishes and response is returned to the user (so the completion times will be captured accurately)? Yes, that could be a separate call, very likely (or as with |
Deciding "which query is more expensive" is out of scope for this issue. This issue focus only on "how to get the query-level resource usage (CPU, memory usage) metrics". It could be a future improvements of the top n queries feature - we can come up with a "scoring" mechanism to evaluate which query is more expensive based on multiple metrics like latency, resource usage, index/shard involved etc. |
@reta Thanks for the comment! Yes, in fact the implementation of this draft PR: #12473 is similar to what you described. We capture and store the task level resource usages in the query insights plugin after the tasks finish, and consolidate them on the cluster manager node. It would be nice if you can also take a look at the draft when you got time :) |
@ansjcy |
@sgup432 I think having per-query resource estimation would be a tremendously useful feature, however not easy to implement, if you have viable proposal - please share, otherwise we sadly piggy back on the resource tracking approaches all the time ... |
Is your feature request related to a problem? Please describe
The resource tracking framework (#1179) tracks task-level resource usage, such as CPU and memory utilization. However there's a gap to infer query-level resource usage from the resource tracking framework. We need to come up a solution for it since it would one of the most important metrics for query insights (#11429) features like top n queries (#11186) and also cost estimations (#12390).
Describe the solution you'd like
The most challenging task here is how to propagate the task-level resource usage information to the coordinator node for calculating query-level resource usage. The most straightforward solution is to piggyback the resource usage data as part of the
SearchPhaseResult
node response and useSearchRequestOperationsListener::onPhaseEnd
to extract this information from the phase results and forward it to the query insights framework. However, this approach has limitations as the obtained resource usage data may not be entirely accurate. The reason is explained below.Here's the workflow of a search request and resource tracking: The coordinator node sends requests to data nodes and the data nodes will create tasks to do search on shards. On a data node,
SearchPhaseResult
to the coordinator node;If we want to piggyback the resource utilization data in
SearchPhaseResult
, we must retrieve this data before the task is considered "finished." Through some experiments and analysis, I found reading the resource utilization data before the second step would result in up to ~10% lower CPU and Memory utilization compared to the final actual usage. If the actual results are not accurate at all, the data would be of no use except for roughly analyzing the overall usage trend and "relative" resource usage comparasion between 2 queries - We won't be able to use this data to make reliable query cost estimations.Related component
Search:Query Insights
Describe alternatives you've considered
Another approach is to implement an asynchronous post-processor as part of the query insights data consumption pipeline. This post-processor would periodically gather data from data nodes and correlate it with queries to calculate the final resource usage accurately. While this method ensures the most accurate resource usage data, it comes with the overhead of introducing a periodic job running in the background to collect and share the data between nodes. We need to consider the trade-offs when deciding on the best approach for capturing query-level resource usage.
Additional context
The text was updated successfully, but these errors were encountered: