-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rust] DiskANN search #798
Conversation
c0ce839
to
28463a5
Compare
595b90f
to
f6b4ffe
Compare
f564ce3
to
2a0b11e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have quite a bit to learn still about how our indexes work, but mostly seems reasonable. Had a few questions.
pub fn new(vertices: &[V], data: MatrixView, metric_type: MetricType) -> Self { | ||
Self { | ||
nodes: vertices | ||
.iter() | ||
.map(|v| Node { | ||
vertex: v.clone(), | ||
neighbors: Vec::new(), | ||
neighbors: Arc::new(UInt32Array::from(vec![] as Vec<u32>)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an nontrivial amount of allocations, right? Maybe this should be wrapped in an option instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I can do it. It makes the Graph::neightbour()
hard to work tho.
Is there a way that we can make the Graph::neighbors()
returns &[f32]
slice?
@@ -136,20 +155,37 @@ impl<V: Vertex> PersistedGraph<V> { | |||
return Ok(vertex.clone()); | |||
} | |||
} | |||
let prefetch_size = self.params.prefetch_byte_size / self.vertex_size + 1; | |||
let end = std::cmp::min(self.len(), id as usize + prefetch_size); | |||
let end = (id + 1) as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we drop using the prefetch size here? Maybe that's worth a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here, want to have a most naive implementation for base line and start from there.
The prefetch it did was only prefetch the row ids, and it causes more random I/Os to read the raw vectors back later before filling into the cache. I have not had the tracing data to prove such prefetch will work yet. Will work on optimization next.
/// Open the Vector index on dataset, specified by the `uuid`. | ||
pub(crate) async fn open_index<'a>( | ||
dataset: &'a Dataset, | ||
pub(crate) async fn open_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the docs be updated for the new index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This index is not ready for public usage yet. After this issue, we still need some work to profiling and optimizations.
This PR just make it ready for the team to start do benchmark e2e.
No description provided.