-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understand/Improve the performance of date histograms #9310
Comments
I did some more experiments running histogram query on one node setup and tracking the took times:
Initially, I rewrote this as cardinality filter query and took time was significantly lesser for one month:
Even though the took for one month is much lesser (and overall query latency can be significantly reduced), I was not too keen on this approach as it had very limited applications (only cardinality aggregations without aggregation composition) |
While going through the lucene code, I noticed that the critical path is date histogram collector which is very iterative in nature. But, it is very efficient at computing results for many levels of composition with single iteration. Hence, I came up with another way of rewriting the query applicable to more histogram aggregation queries with following properties:
Key idea is to slice the date range itself into multiple queries using rewrite. The response transformation is quite simple - union over disjoint sets. Sharing some results below (doing the rewrite on client instead of server): Partitioning into 2 slices:
Partitioning into 4 slices:
Partitioning into 12 slices:
|
As we can observe from took times above, the most reduction comes from 1 to 2 slices (~50%), increasing the number of slices further does not see as much jump. The limitation from single node compute might also be coming into play If we combine the slicing approach and cardinality specific rewrite, the reduction is more than 80% in latency. Although, that approach does not give any flexibility on the number of slices and limited to specific cardinality query like above. Sample invocation:
|
Applying the same slicing concept on composite date histogram aggregation, we can see similar improvement (26s to 14s):
|
Thanks @jainankitk for the details. While this approach of split (into multiple smaller range queries) and merge (into one result) looks promising, I think the net CPU utilization will remain the same. Or in other words, in a busy system where are all search threads are (nearly) occupied, the performance benefit may not be seen. Also, is there any overhead caused by the number of splits, for example, consider a date aggregation query where the results are aggregated at a month level for an year and say the data is split across 20 shards in 20 different nodes. In that case, instead of 20 network call (between the nodes), this would make 20*12 network calls and there may be some overhead with respect of serializing/de-serializing the request/responses. Have you tried this with a setup containing more number of nodes? |
@jainankitk How do you think this would compare to using concurrent search across two slices? Unless I'm mistaken, the drop in latency comes from running the two slices in parallel in the Using concurrent search, we would mostly end up doing the same thing, especially if segments hold disjoint contiguous time ranges (which isn't guaranteed, but can be made more likely with #9241). |
The impact will be similar.
While the benefits are similar, this approach is more generic (for date histogram aggregations) as it is agnostic of number of segments. Also, the rewrite using |
@backslasht - Completely agree with you on this. Although I don't have any data on this, IMO date histogram aggregations are mostly generated by UI or dashboard components for visualizing the data. They are quite expensive as well (~7s) rendering them too slow or impractical for low latency applications. Hence, the request concurrency should not be high.
Assuming the throughput of such requests is not high, the overhead caused due to 20*12 network calls should not be huge. We can also evaluate the query splitting on data node (instead of coordinator) to avoid network calls overhead. |
@jainankitk how many number of shards you have in index ? Running your query with 2 slice is equivalent to doubling the shard count right ? |
I ran this experiment with single shard, single node.
Yes, but doubling the shard count dynamically is not possible. It already is what it is. Given any shard count, we can always rewrite to slice the query like above
While I don't understand the correlation with higher number of replicas, the rewriting is additive to concurrent segment search, given enough CPU is available |
Let say if we have 3 replica for a primary shard, we will have 4 shards total after all on 4 nodes. |
This is indeed a zero sum game -- as you use more concurrency within a single query, latencies get much better for queries when the cluster is not near red-line (100% CPU saturation), yet red-line QPS (capacity) is no better and maybe a bit worse due to costs of thread switching (Java 19's new light weight threads (Project Loom) may help here). But note that this tradeoff is nearly always very worthwhile. Modern CPUs/clusters have tremendous concurrency that is horribly underutilized during "peace time" (cluster not near 100% CPU saturation). Tapping into this concurrency to make single queries faster is an excellent tradeoff -- you massively reduce latency for the few in-flight queries. Most applications very rarely run their clusters at red-line. It's crazy to me how long it took ES/OS to simply make use of Lucene's intra-concurrency feature -- it's been available in Lucene for a looooong time, and CPUs finally became concurrent enough to make it a no-brainer. In our service (Amazon's customer facing product search) we've used this feature since the very start of the (lengthy!) migration to Lucene and it was vital to the success of that migration. If latency / capacity at red-line is really a problem, OpenSearch could add some "smart" logic to taper the concurrency when the cluster is near capacity, because at that point the added intra-query concurrency can only hurt things. But I would test empirically to see if such smart logic is really necessary -- the thread context switching cost is likely low enough (and decreasing with time with Project Loom) that it's not worth doing that. Really OS should focus on reducing query latencies during peace time (low cluster utilization). |
Note that this is true even without the added intra-query concurrency? With document based replication there is no transactional guarantee (consistent point-in-time view of the Search index). But segment based replication can offer this, if we choose to implement it -- a query can know, even across replicas, precisely which point-in-time view of the index it is searching. Replicas can also preserve (keep open, for a time) recent point-in-time views so that queries can be nearly 100% consistent. Given that document based replication has forever NOT had such guarantees, we should not start adding it just yet with segment replication. But segment replication makes it quite a simple feature (transactionaly consistency) to add later on, if it really matters to users ... |
I tried manually rewriting the query as filter aggregation, but did not notice any performance improvement:
|
Okay, I missed adding the range filter in my filter aggregation query. Due to that the flame graph had MatchAll query on which collector was being run:
|
From small POC based on this, the performance improvement for above specific query is from 6947 to 172 ms: With the rewrite and filter aggregation improvements:
Original:
|
Code changes for poc - jainankitk@c3b447c |
Thanks @jainankitk! The numbers are pretty impressive. |
How are you thinking of implementing this? Are you considering to build a generic query rewrite framework? |
I spent sometime on the code and seems that generic query framework across all query types might be tricky. Although, we will add generic logic which can account for different types of
I am planning to do this rewrite at shard level as it already has the parsing logic and we need not duplicate that on the coordinator layer. I am planning to do the rewrite in two steps:
|
Thanks @backslasht. I also verified that even on smaller duration date histograms like 1 day or 15-20 days, the improvement is still about 4-5x. |
After applying the optimizations to date histogram aggregations, saw below numbers for nyc_taxis aggregation benchmark:
|
Benchmark numbers without the optimization on same setup:
|
Code changes for above benchmark: https://github.com/jainankitk/OpenSearch/compare/db90a415ff2fd428b4f7b3f800a51dc229287cb4..bb2a3b8e8f0250f5e7bc5fa22ba2e3258e8fd8e0 |
Thanks @jainankitk this is exciting!! The benchmark numbers for date histogram operations here are very encouraging. While we are trying out the initial POC with the Date Histogram changes, does it makes sense to convert this issue into a META Issue and breakdown the improvements into specific tracks for improvements further? That way we can encourage some incremental/parallel help from the community while we dive into the optimization. |
Hi @jainankitk, will documentation for this feature be required for 2.12? If so, can you please create the doc issue, let me know who will be creating the initial doc PR, and add this dev issue to the unified tracker project? Thanks! |
I have been looking at understanding and improving the performance of date histograms in Opensearch. For this purpose, I have setup single node cluster, and ingested nyc_taxis dataset.
While running single histogram aggregation request in loop, I collected the following cpu flamegraph:
Trying to understand how can we reduce the scoreRange cost for this request
The text was updated successfully, but these errors were encountered: