Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fetching multiple documents #2252

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

GodTamIt
Copy link
Contributor

This attempts to address #2011 by implementing the following:

  • Searcher::docs()
  • Searcher::docs_async()
  • StoreReader::get_many()
  • StoreReader::get_many_async()

src/core/searcher.rs Outdated Show resolved Hide resolved
src/core/searcher.rs Outdated Show resolved Hide resolved
src/core/searcher.rs Outdated Show resolved Hide resolved
src/core/searcher.rs Outdated Show resolved Hide resolved
@GodTamIt
Copy link
Contributor Author

I went with a new approach that eliminates the need for any unsafe and removes the need to create the nested structures. Lmk what y'all think.

@fulmicoton
Copy link
Collaborator

Don't we have a block cache? @PSeitz Why is this useful?

@PSeitz
Copy link
Contributor

PSeitz commented Nov 15, 2023

With the current API there are two issues.

  1. To avoid eviction issues in the block cache, the docs need to be sorted, but this is not obvious to the user.
  2. Skipping inside a block is done potentially multiple times and not incrementally (not sure if that really matters)

///
/// This method is more efficient than calling [`doc`](Self::doc) multiple times, as it batches
/// overlapping requests to segments and blocks.
pub fn docs<D: DocumentDeserialize>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could be simplified, if we rely on the BlockCache. Since the doc_adresses are already sorted we can just call doc and always hit the cache.

Downside would be that this works only if the BlockCache parameter cache_num_blocks is not zero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love relying on this invariant, since the dependency is non-obvious, but I've left a comment. It certainly does simplify the code here, though.

In the future, we could also use a thread-pool to parallelize access to blocks, which may require us to revive the previous code but until then, this is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written I don't think this pulls it weight as an API addition, especially since it enforces a data structure (BTreeSet) that stays sorted under modification instead of for example just sorting a slice of DocAddresses.

Maybe a performance hint on Searcher::doc to access multiple documents in address order if possible to improve cache hit rates would serve the same purpose?

/// Calling [`get_many`](Self::get_many) is more efficient than calling [`get`](Self::get)
/// multiple times, as it will only read and decompress a block once for all documents within a
/// block.
pub fn get_many<D: DocumentDeserialize>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be used any more?

/// This method is more efficient than calling [`doc_async`](Self::doc_async) multiple times, as
/// it batches overlapping requests to segments and blocks.
#[cfg(feature = "quickwit")]
pub async fn docs_async<D: DocumentDeserialize>(
Copy link
Collaborator

@adamreichold adamreichold Nov 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this is not a particularly technical objection, but I do wonder whether it is helpful to add this as additional API surface to Tantivy itself. User code can already sort DocAddress and dispatch multiple tasks and/or futures for each segment. And user code usually has the best understanding which trade-off to take w.r.t. parallelism, performance and locality.

For example, a service serving multiple users would probably aim for increased locality by handling a single user request as a single task accessing documents in address order to improve cache efficiency. Such a service would handle parallelism and concurrency by handling multiple users in multiple tasks.

Furthermore, I have a hard time seeing a situation where additional within-segment concurrency is better than serial cache-friendly access patterns. Also, the additional data structures required to track this additional concurrency seem ripe with trade-offs which user code is just in a better position to make, e.g. use FuturesUnordered for concurrency but not parallelism or spawn multiple tasks into a JoinSet to actually load multiple hardware threads.

Long story short, personally, I am not convinced this API is pulling its weight in terms of complexity and implied trade-offs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not reviewing the PR, but just commenting on why I think there is a need for something in that area:

In the sync world, the best access pattern is pretty evident. Make the cache store one docstore block, and sort your doc ids. If multiple docs need the same block, every doc but the first of each block comes free (no io, no decompression, just some quick de-serialization). No duplicated work, maximum cache hit.

Now enter the async world. If you sort your docs, and fetch multiple docs concurrently, you might end up fetching multiple docs from the same block, concurrently. They will both see a cache-miss, issue an io, decompress the result, put that in cache, and get the doc of their interest. In quickwit-oss/quickwit#3226 it was found that this very much happens. In this case, user code has pretty much no idea of what is the best order. Random might cause more cache misses, sorted do cause more cache misses. Issuing in some predetermined order that maximize the change that cache is already available , while not making access too far apart so it didn't get evicted was judged an "obscure trick" (which I don't disagree with).

I don't know the best solution is. Maybe it's a function to fetch multiple docs at once (originally my suggestion) but that makes tantivy choose a tradeoff which might not be the best for your applicatin. Maybe it's having a function that gives insight in what order to use (something which, given a list of doc ids, tells you which can be done concurrently, and which should be done once another doc has already warmed the cache), letting you choose your tradeoff having the full picture. But one way or another, the status quo is that short of obscure tricks, you are leaving some performance on the table with no good reason.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that main problem is that user code lacks information required to choose the best access pattern. For example, a function which informs user code about the blocks for a given document (opaquely like fn cache_key(DocId) -> usize) and a way to clone a StoreReader (to avoid reopening from disk, but get independent block caches) might suffice to know which documents end up in the same block and to control which I/O is issued against the same cache instance.

(I guess a function which takes something like &mut [(DocId, usize)] and sorts that slice and with the cache keys filled in might be nice to get the most computational efficiency, but I suspect the cost of repeatedly calling block_checkpoint is still dwarfed by the cost of I/O and decompression. Alternatively &mut BTreeMap<DocId, usize> would be more inline with what is proposed here.)

Copy link
Contributor Author

@GodTamIt GodTamIt Dec 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a function to fetch multiple docs at once (originally my suggestion) but that makes tantivy choose a tradeoff which might not be the best for your applicatin.

While I agree that it may be beneficial to have the fine-grained control to choose the trade-off, from an API design perspective, I believe default behavior should be that users don't have to know about these implementation details. This should be an opt-in API. Otherwise, it'd be nice if the non-fine-tuned path had sane, reasonably performant defaults.

If you sort your docs, and fetch multiple docs concurrently, you might end up fetching multiple docs from the same block, concurrently.

This is very good to know. At least in this change, I try to group all documents within the same block into one future to minimize this. Of course, multiple separate calls will not benefit from this grouping but I'd want to see some cases where that is a common-enough pattern to warrant further optimization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good to know. At least in this change, I try to group all documents within the same block into one future to minimize this. Of course, multiple separate calls will not benefit from this grouping but I'd want to see some cases where that is a common-enough pattern to warrant further optimization.

I think the point is here exactly that those futures run concurrently against the same cache and hence will evict each others blocks. After all, fetching multiple blocks concurrently is sort of the point of concurrency here. Hence, I think we should start with cheaply cloning StoreReader to give the programs control over caches (and the function to opaquely inform user code about cache keys) and then maybe use this here if we can reach a compromise w.r.t. concurrency/parallelism/etc. for a default API.

(Concurrency in contrast to parallelism likely does little in a normal Tantivy local-disk-based setup in contrast to a directory implementation calling into e.g. S3. Parallelism on the other hand would also benefit CPU-bound decompression but usually needs integration with the runtime to spawn tasks.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed some preliminary API additions in #2276. Using these, I think fetching multiple documents should consistent of

  • sort all documents into groups with the same cache key/from the same block
  • deciding how many independent block caches should be used and how large those should be
  • then futures should pull groups of documents and handle them using their private cache instance
  • these futures should either be spawn onto independent tasks (parallelism, e.g. JoinSet) or polled concurrently (e.g. FuturesUnordered)
  • ideally it would be possible to reuse those store readers (and spawned tasks) to fetch multiple groups of documents

Obviously there are quite a few trade-offs implied here, the big one being concurrency versus parallelism. Some can possibly be side stepped, e.g always use a size-1 block each for each group and spawn as many futures/tasks as groups which would always avoid conflicts at the cost of more in-memory data structures. Similarly, we could forgo reuse at the same cost of setting up in-memory data structures.

Copy link
Collaborator

@adamreichold adamreichold Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can side-step the concurrency/parallelism trade-off here by providing an API that just yields the futures and requires the caller to poll them to completion: So a signature like

async fn docs_async(&self, doc_addresses: impl IntoIterator<Item=DocAddress>) -> impl Iterator<Item=impl Future<Output=crate::Result<Vec<D>>>>;

which internally sets up the per-segment-per-block groups (e.g. sorting them into HashMap<SegmentOrdinal, HashMap<CacheKey, Vec<DocId>>) with independent size-1 block caches (using e.g. self.store_readers[segment_ord as usize].fork_cache(1)) and yields a future for each group which the caller can then push into a FuturesUnordered or spawn onto a JoinSet as they see fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for somewhat hijacking this cache, but I pushed an implementation of the above to #2276 which as discussed is still a specific trade-off but should avoid cache trashing while still reusing the existing block cache via StoreReader::get_async and leaving the decision of whether to use concurrency or parallelism to the user code (which is very little syntactical overhead if it just ends up being FuturesUnordered as the modified test case shows).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants