Support fetching multiple documents #2252

GodTamIt · 2023-11-14T03:44:35Z

This attempts to address #2011 by implementing the following:

Searcher::docs()
Searcher::docs_async()
StoreReader::get_many()
StoreReader::get_many_async()

src/core/searcher.rs

GodTamIt · 2023-11-14T14:10:39Z

I went with a new approach that eliminates the need for any unsafe and removes the need to create the nested structures. Lmk what y'all think.

fulmicoton · 2023-11-15T00:36:38Z

Don't we have a block cache? @PSeitz Why is this useful?

PSeitz · 2023-11-15T01:45:43Z

With the current API there are two issues.

To avoid eviction issues in the block cache, the docs need to be sorted, but this is not obvious to the user.
Skipping inside a block is done potentially multiple times and not incrementally (not sure if that really matters)

PSeitz · 2023-11-15T10:16:39Z

src/core/searcher.rs

+    ///
+    /// This method is more efficient than calling [`doc`](Self::doc) multiple times, as it batches
+    /// overlapping requests to segments and blocks.
+    pub fn docs<D: DocumentDeserialize>(


This function could be simplified, if we rely on the BlockCache. Since the doc_adresses are already sorted we can just call doc and always hit the cache.

Downside would be that this works only if the BlockCache parameter cache_num_blocks is not zero

I don't love relying on this invariant, since the dependency is non-obvious, but I've left a comment. It certainly does simplify the code here, though.

In the future, we could also use a thread-pool to parallelize access to blocks, which may require us to revive the previous code but until then, this is fine.

As written I don't think this pulls it weight as an API addition, especially since it enforces a data structure (BTreeSet) that stays sorted under modification instead of for example just sorting a slice of DocAddresses.

Maybe a performance hint on Searcher::doc to access multiple documents in address order if possible to improve cache hit rates would serve the same purpose?

adamreichold · 2023-11-26T14:20:58Z

src/store/reader.rs

+    /// Calling [`get_many`](Self::get_many) is more efficient than calling [`get`](Self::get)
+    /// multiple times, as it will only read and decompress a block once for all documents within a
+    /// block.
+    pub fn get_many<D: DocumentDeserialize>(


This does not seem to be used any more?

adamreichold · 2023-11-26T14:32:20Z

src/core/searcher.rs

+    /// This method is more efficient than calling [`doc_async`](Self::doc_async) multiple times, as
+    /// it batches overlapping requests to segments and blocks.
+    #[cfg(feature = "quickwit")]
+    pub async fn docs_async<D: DocumentDeserialize>(


I understand that this is not a particularly technical objection, but I do wonder whether it is helpful to add this as additional API surface to Tantivy itself. User code can already sort DocAddress and dispatch multiple tasks and/or futures for each segment. And user code usually has the best understanding which trade-off to take w.r.t. parallelism, performance and locality.

For example, a service serving multiple users would probably aim for increased locality by handling a single user request as a single task accessing documents in address order to improve cache efficiency. Such a service would handle parallelism and concurrency by handling multiple users in multiple tasks.

Furthermore, I have a hard time seeing a situation where additional within-segment concurrency is better than serial cache-friendly access patterns. Also, the additional data structures required to track this additional concurrency seem ripe with trade-offs which user code is just in a better position to make, e.g. use FuturesUnordered for concurrency but not parallelism or spawn multiple tasks into a JoinSet to actually load multiple hardware threads.

Long story short, personally, I am not convinced this API is pulling its weight in terms of complexity and implied trade-offs.

Not reviewing the PR, but just commenting on why I think there is a need for something in that area:

In the sync world, the best access pattern is pretty evident. Make the cache store one docstore block, and sort your doc ids. If multiple docs need the same block, every doc but the first of each block comes free (no io, no decompression, just some quick de-serialization). No duplicated work, maximum cache hit.

Now enter the async world. If you sort your docs, and fetch multiple docs concurrently, you might end up fetching multiple docs from the same block, concurrently. They will both see a cache-miss, issue an io, decompress the result, put that in cache, and get the doc of their interest. In quickwit-oss/quickwit#3226 it was found that this very much happens. In this case, user code has pretty much no idea of what is the best order. Random might cause more cache misses, sorted do cause more cache misses. Issuing in some predetermined order that maximize the change that cache is already available , while not making access too far apart so it didn't get evicted was judged an "obscure trick" (which I don't disagree with).

I don't know the best solution is. Maybe it's a function to fetch multiple docs at once (originally my suggestion) but that makes tantivy choose a tradeoff which might not be the best for your applicatin. Maybe it's having a function that gives insight in what order to use (something which, given a list of doc ids, tells you which can be done concurrently, and which should be done once another doc has already warmed the cache), letting you choose your tradeoff having the full picture. But one way or another, the status quo is that short of obscure tricks, you are leaving some performance on the table with no good reason.

I would say that main problem is that user code lacks information required to choose the best access pattern. For example, a function which informs user code about the blocks for a given document (opaquely like fn cache_key(DocId) -> usize) and a way to clone a StoreReader (to avoid reopening from disk, but get independent block caches) might suffice to know which documents end up in the same block and to control which I/O is issued against the same cache instance.

(I guess a function which takes something like &mut [(DocId, usize)] and sorts that slice and with the cache keys filled in might be nice to get the most computational efficiency, but I suspect the cost of repeatedly calling block_checkpoint is still dwarfed by the cost of I/O and decompression. Alternatively &mut BTreeMap<DocId, usize> would be more inline with what is proposed here.)

Maybe it's a function to fetch multiple docs at once (originally my suggestion) but that makes tantivy choose a tradeoff which might not be the best for your applicatin.

While I agree that it may be beneficial to have the fine-grained control to choose the trade-off, from an API design perspective, I believe default behavior should be that users don't have to know about these implementation details. This should be an opt-in API. Otherwise, it'd be nice if the non-fine-tuned path had sane, reasonably performant defaults.

If you sort your docs, and fetch multiple docs concurrently, you might end up fetching multiple docs from the same block, concurrently.

This is very good to know. At least in this change, I try to group all documents within the same block into one future to minimize this. Of course, multiple separate calls will not benefit from this grouping but I'd want to see some cases where that is a common-enough pattern to warrant further optimization.

This is very good to know. At least in this change, I try to group all documents within the same block into one future to minimize this. Of course, multiple separate calls will not benefit from this grouping but I'd want to see some cases where that is a common-enough pattern to warrant further optimization.

I think the point is here exactly that those futures run concurrently against the same cache and hence will evict each others blocks. After all, fetching multiple blocks concurrently is sort of the point of concurrency here. Hence, I think we should start with cheaply cloning StoreReader to give the programs control over caches (and the function to opaquely inform user code about cache keys) and then maybe use this here if we can reach a compromise w.r.t. concurrency/parallelism/etc. for a default API.

(Concurrency in contrast to parallelism likely does little in a normal Tantivy local-disk-based setup in contrast to a directory implementation calling into e.g. S3. Parallelism on the other hand would also benefit CPU-bound decompression but usually needs integration with the runtime to spawn tasks.)

I proposed some preliminary API additions in #2276. Using these, I think fetching multiple documents should consistent of

sort all documents into groups with the same cache key/from the same block

deciding how many independent block caches should be used and how large those should be

then futures should pull groups of documents and handle them using their private cache instance

these futures should either be spawn onto independent tasks (parallelism, e.g. JoinSet) or polled concurrently (e.g. FuturesUnordered)

ideally it would be possible to reuse those store readers (and spawned tasks) to fetch multiple groups of documents

Obviously there are quite a few trade-offs implied here, the big one being concurrency versus parallelism. Some can possibly be side stepped, e.g always use a size-1 block each for each group and spawn as many futures/tasks as groups which would always avoid conflicts at the cost of more in-memory data structures. Similarly, we could forgo reuse at the same cost of setting up in-memory data structures.

Maybe we can side-step the concurrency/parallelism trade-off here by providing an API that just yields the futures and requires the caller to poll them to completion: So a signature like

async fn docs_async(&self, doc_addresses: impl IntoIterator<Item=DocAddress>) -> impl Iterator<Item=impl Future<Output=crate::Result<Vec<D>>>>;

which internally sets up the per-segment-per-block groups (e.g. sorting them into HashMap<SegmentOrdinal, HashMap<CacheKey, Vec<DocId>>) with independent size-1 block caches (using e.g. self.store_readers[segment_ord as usize].fork_cache(1)) and yields a future for each group which the caller can then push into a FuturesUnordered or spawn onto a JoinSet as they see fit.

Sorry for somewhat hijacking this cache, but I pushed an implementation of the above to #2276 which as discussed is still a specific trade-off but should avoid cache trashing while still reusing the existing block cache via StoreReader::get_async and leaving the decision of whether to use concurrency or parallelism to the user code (which is very little syntactical overhead if it just ends up being FuturesUnordered as the modified test case shows).

adamreichold reviewed Nov 14, 2023

View reviewed changes

src/core/searcher.rs Outdated Show resolved Hide resolved

PSeitz reviewed Nov 14, 2023

View reviewed changes

src/core/searcher.rs Outdated Show resolved Hide resolved

adamreichold reviewed Nov 14, 2023

View reviewed changes

src/core/searcher.rs Outdated Show resolved Hide resolved

PSeitz reviewed Nov 14, 2023

View reviewed changes

src/core/searcher.rs Outdated Show resolved Hide resolved

GodTamIt force-pushed the fetch-many branch from 1e5ccb5 to c420dfa Compare November 14, 2023 14:09

GodTamIt requested review from PSeitz and adamreichold November 14, 2023 14:10

GodTamIt force-pushed the fetch-many branch from c420dfa to f0ff5eb Compare November 14, 2023 14:14

GodTamIt added 2 commits November 14, 2023 13:52

Support fetching multiple documents

ac4f8e6

Rewrite to take BTreeSet for addresses

221b988

GodTamIt force-pushed the fetch-many branch from f0ff5eb to 221b988 Compare November 14, 2023 18:54

PSeitz reviewed Nov 15, 2023

View reviewed changes

Simplify docs by assuming a non-zero BlockCache

678d9fc

GodTamIt force-pushed the fetch-many branch from 66c0095 to 678d9fc Compare November 15, 2023 16:53

GodTamIt requested a review from PSeitz November 15, 2023 16:59

adamreichold reviewed Nov 26, 2023

View reviewed changes

adamreichold mentioned this pull request Dec 11, 2023

RFC: Transparently fetch multiple documents grouped per block #2276

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fetching multiple documents #2252

Support fetching multiple documents #2252

GodTamIt commented Nov 14, 2023

GodTamIt commented Nov 14, 2023

fulmicoton commented Nov 15, 2023

PSeitz commented Nov 15, 2023

PSeitz Nov 15, 2023

GodTamIt Nov 15, 2023

adamreichold Nov 26, 2023

adamreichold Nov 26, 2023

adamreichold Nov 26, 2023 •

edited

Loading

trinity-1686a Dec 8, 2023

adamreichold Dec 8, 2023

GodTamIt Dec 10, 2023 •

edited

Loading

adamreichold Dec 11, 2023

adamreichold Dec 11, 2023

adamreichold Dec 11, 2023 •

edited

Loading

adamreichold Dec 11, 2023

Support fetching multiple documents #2252

Are you sure you want to change the base?

Support fetching multiple documents #2252

Conversation

GodTamIt commented Nov 14, 2023

GodTamIt commented Nov 14, 2023

fulmicoton commented Nov 15, 2023

PSeitz commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreichold Nov 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GodTamIt Dec 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreichold Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreichold Nov 26, 2023 •

edited

Loading

GodTamIt Dec 10, 2023 •

edited

Loading

adamreichold Dec 11, 2023 •

edited

Loading