-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement deletion vector handling in index scan #958
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking fine so far. But definitely seems like we'll want to be benchmarking this. Let me get a PR up for that.
TL;DR:
after picking in #960 the difference of index scan performance pre/post this PR is negligible. Any perf impact introduced is smaller in magnitude than noise. The impact maybe become more pronunciated in a environment where IO latency is high, say, EC2 instance with EBS, which is a remote disk. Since this change introduces more IO, I'd hypothesis that there could be a hit to the query throughput |
is this for scanning or the ANN search? |
|
9c85d52
to
eb32e03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to handle the diskann implementation.
rust/src/index/vector/pq.rs
Outdated
// Precollecting everything to a hash map requires async closure. This way we avoid that | ||
let mut deletion_vec_cache: HashMap<u64, DeletionVector> = HashMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to duplicate this for other indices, should we encapsulate this into it's own struct? I think that also let's us not bring the manifest and object store down into here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this cache is only valid within each look up. Let me add deletion handling in diskANN. What you think about adding session caching for deletion vector in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore the last comment. it's actually easier to just add a DeletionLookup
in the session than modify two index
rust/src/index/vector/pq.rs
Outdated
if dvec.is_err() { | ||
return Err(Error::Index { | ||
message: format!( | ||
"Failed to read deletion file for fragment {}", | ||
frag_id | ||
), | ||
}); | ||
} | ||
dvec.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic masks the underlying error.
Could we just do:
if dvec.is_err() { | |
return Err(Error::Index { | |
message: format!( | |
"Failed to read deletion file for fragment {}", | |
frag_id | |
), | |
}); | |
} | |
dvec.unwrap() | |
dvec? |
28f66b9
to
f53d98b
Compare
@@ -341,6 +343,74 @@ pub(crate) async fn read_deletion_file( | |||
} | |||
} | |||
|
|||
pub struct LruDeletionVectorStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. In the future, we might have the reads make concurrent queries on is_deleted()
, in which case we might change this to:
- Use Rwlock instead of Mutex (unless we find writer starvation is an issue)
- Make the IO happen separately, to prevent the situation where two concurrent calls are making the same IO call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. I tried to use RWLock
but LruCache
requires mut ref for getting items too (updating the recently used state probably)
I need to find a way to go around that
f53d98b
to
def6183
Compare
closes #916
Implement deletion vector handling in
PQIndex
, where all vector index are stored.TODO: