-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support fetching multiple documents #2252
base: main
Are you sure you want to change the base?
Conversation
I went with a new approach that eliminates the need for any |
Don't we have a block cache? @PSeitz Why is this useful? |
With the current API there are two issues.
|
/// | ||
/// This method is more efficient than calling [`doc`](Self::doc) multiple times, as it batches | ||
/// overlapping requests to segments and blocks. | ||
pub fn docs<D: DocumentDeserialize>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could be simplified, if we rely on the BlockCache
. Since the doc_adresses are already sorted we can just call doc
and always hit the cache.
Downside would be that this works only if the BlockCache
parameter cache_num_blocks
is not zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love relying on this invariant, since the dependency is non-obvious, but I've left a comment. It certainly does simplify the code here, though.
In the future, we could also use a thread-pool to parallelize access to blocks, which may require us to revive the previous code but until then, this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written I don't think this pulls it weight as an API addition, especially since it enforces a data structure (BTreeSet
) that stays sorted under modification instead of for example just sorting a slice of DocAddress
es.
Maybe a performance hint on Searcher::doc
to access multiple documents in address order if possible to improve cache hit rates would serve the same purpose?
/// Calling [`get_many`](Self::get_many) is more efficient than calling [`get`](Self::get) | ||
/// multiple times, as it will only read and decompress a block once for all documents within a | ||
/// block. | ||
pub fn get_many<D: DocumentDeserialize>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be used any more?
/// This method is more efficient than calling [`doc_async`](Self::doc_async) multiple times, as | ||
/// it batches overlapping requests to segments and blocks. | ||
#[cfg(feature = "quickwit")] | ||
pub async fn docs_async<D: DocumentDeserialize>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this is not a particularly technical objection, but I do wonder whether it is helpful to add this as additional API surface to Tantivy itself. User code can already sort DocAddress
and dispatch multiple tasks and/or futures for each segment. And user code usually has the best understanding which trade-off to take w.r.t. parallelism, performance and locality.
For example, a service serving multiple users would probably aim for increased locality by handling a single user request as a single task accessing documents in address order to improve cache efficiency. Such a service would handle parallelism and concurrency by handling multiple users in multiple tasks.
Furthermore, I have a hard time seeing a situation where additional within-segment concurrency is better than serial cache-friendly access patterns. Also, the additional data structures required to track this additional concurrency seem ripe with trade-offs which user code is just in a better position to make, e.g. use FuturesUnordered
for concurrency but not parallelism or spawn multiple tasks into a JoinSet
to actually load multiple hardware threads.
Long story short, personally, I am not convinced this API is pulling its weight in terms of complexity and implied trade-offs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not reviewing the PR, but just commenting on why I think there is a need for something in that area:
In the sync world, the best access pattern is pretty evident. Make the cache store one docstore block, and sort your doc ids. If multiple docs need the same block, every doc but the first of each block comes free (no io, no decompression, just some quick de-serialization). No duplicated work, maximum cache hit.
Now enter the async world. If you sort your docs, and fetch multiple docs concurrently, you might end up fetching multiple docs from the same block, concurrently. They will both see a cache-miss, issue an io, decompress the result, put that in cache, and get the doc of their interest. In quickwit-oss/quickwit#3226 it was found that this very much happens. In this case, user code has pretty much no idea of what is the best order. Random might cause more cache misses, sorted do cause more cache misses. Issuing in some predetermined order that maximize the change that cache is already available , while not making access too far apart so it didn't get evicted was judged an "obscure trick" (which I don't disagree with).
I don't know the best solution is. Maybe it's a function to fetch multiple docs at once (originally my suggestion) but that makes tantivy choose a tradeoff which might not be the best for your applicatin. Maybe it's having a function that gives insight in what order to use (something which, given a list of doc ids, tells you which can be done concurrently, and which should be done once another doc has already warmed the cache), letting you choose your tradeoff having the full picture. But one way or another, the status quo is that short of obscure tricks, you are leaving some performance on the table with no good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that main problem is that user code lacks information required to choose the best access pattern. For example, a function which informs user code about the blocks for a given document (opaquely like fn cache_key(DocId) -> usize
) and a way to clone a StoreReader
(to avoid reopening from disk, but get independent block caches) might suffice to know which documents end up in the same block and to control which I/O is issued against the same cache instance.
(I guess a function which takes something like &mut [(DocId, usize)]
and sorts that slice and with the cache keys filled in might be nice to get the most computational efficiency, but I suspect the cost of repeatedly calling block_checkpoint
is still dwarfed by the cost of I/O and decompression. Alternatively &mut BTreeMap<DocId, usize>
would be more inline with what is proposed here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's a function to fetch multiple docs at once (originally my suggestion) but that makes tantivy choose a tradeoff which might not be the best for your applicatin.
While I agree that it may be beneficial to have the fine-grained control to choose the trade-off, from an API design perspective, I believe default behavior should be that users don't have to know about these implementation details. This should be an opt-in API. Otherwise, it'd be nice if the non-fine-tuned path had sane, reasonably performant defaults.
If you sort your docs, and fetch multiple docs concurrently, you might end up fetching multiple docs from the same block, concurrently.
This is very good to know. At least in this change, I try to group all documents within the same block into one future to minimize this. Of course, multiple separate calls will not benefit from this grouping but I'd want to see some cases where that is a common-enough pattern to warrant further optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good to know. At least in this change, I try to group all documents within the same block into one future to minimize this. Of course, multiple separate calls will not benefit from this grouping but I'd want to see some cases where that is a common-enough pattern to warrant further optimization.
I think the point is here exactly that those futures run concurrently against the same cache and hence will evict each others blocks. After all, fetching multiple blocks concurrently is sort of the point of concurrency here. Hence, I think we should start with cheaply cloning StoreReader
to give the programs control over caches (and the function to opaquely inform user code about cache keys) and then maybe use this here if we can reach a compromise w.r.t. concurrency/parallelism/etc. for a default API.
(Concurrency in contrast to parallelism likely does little in a normal Tantivy local-disk-based setup in contrast to a directory implementation calling into e.g. S3. Parallelism on the other hand would also benefit CPU-bound decompression but usually needs integration with the runtime to spawn tasks.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proposed some preliminary API additions in #2276. Using these, I think fetching multiple documents should consistent of
- sort all documents into groups with the same cache key/from the same block
- deciding how many independent block caches should be used and how large those should be
- then futures should pull groups of documents and handle them using their private cache instance
- these futures should either be spawn onto independent tasks (parallelism, e.g.
JoinSet
) or polled concurrently (e.g.FuturesUnordered
) - ideally it would be possible to reuse those store readers (and spawned tasks) to fetch multiple groups of documents
Obviously there are quite a few trade-offs implied here, the big one being concurrency versus parallelism. Some can possibly be side stepped, e.g always use a size-1 block each for each group and spawn as many futures/tasks as groups which would always avoid conflicts at the cost of more in-memory data structures. Similarly, we could forgo reuse at the same cost of setting up in-memory data structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can side-step the concurrency/parallelism trade-off here by providing an API that just yields the futures and requires the caller to poll them to completion: So a signature like
async fn docs_async(&self, doc_addresses: impl IntoIterator<Item=DocAddress>) -> impl Iterator<Item=impl Future<Output=crate::Result<Vec<D>>>>;
which internally sets up the per-segment-per-block groups (e.g. sorting them into HashMap<SegmentOrdinal, HashMap<CacheKey, Vec<DocId>>
) with independent size-1 block caches (using e.g. self.store_readers[segment_ord as usize].fork_cache(1)
) and yields a future for each group which the caller can then push into a FuturesUnordered
or spawn onto a JoinSet
as they see fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for somewhat hijacking this cache, but I pushed an implementation of the above to #2276 which as discussed is still a specific trade-off but should avoid cache trashing while still reusing the existing block cache via StoreReader::get_async
and leaving the decision of whether to use concurrency or parallelism to the user code (which is very little syntactical overhead if it just ends up being FuturesUnordered
as the modified test case shows).
This attempts to address #2011 by implementing the following:
Searcher::docs()
Searcher::docs_async()
StoreReader::get_many()
StoreReader::get_many_async()