Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is ArrayLake a VectorDB? #6

Open
alxmrs opened this issue Jan 14, 2024 · 5 comments
Open

Is ArrayLake a VectorDB? #6

alxmrs opened this issue Jan 14, 2024 · 5 comments

Comments

@alxmrs
Copy link

alxmrs commented Jan 14, 2024

Not that EarthMover would necessarily want to be known as yet another startup within this very competitive space. However, I think it would be cool to see a comparison between other Vector DBs and what a managed Zarr dataset could do. It seems like it would be easy to put a proof of concept together with faiss or annoy.

I think approximate similarity search algorithms could be interesting for scientific use cases (can it provide better lookups than metadata based search?). Further, I like that ArrayLake + Zarr address the Cloud and State Management shaped problems while stepping aside so the ML practitioner can choose their preferred tool for similarity search.

@rabernat
Copy link
Contributor

Thanks for the suggestion Alex! You're correct that it's fairly easy to create a vector search interface on top of Xarray + Zarr. Here's an example: https://gist.github.com/rabernat/40f53bba3a81aeb420e14872388c6fc1

In contrast to most vector DB's on the market today, all of the index building and search happen on the client side--Arraylake doesn't provide any server-side implementations for any of this. So I'd be hesitant to characterize Arraylake as a VectorDB.

@alxmrs
Copy link
Author

alxmrs commented Jan 17, 2024 via email

@rabernat
Copy link
Contributor

😍

Would you like to help turn my gist into a proper Python package? Could be a good project for your sabbatical? 😉

@alxmrs
Copy link
Author

alxmrs commented Jan 17, 2024 via email

@ljstrnadiii
Copy link

That would be a cool package. Lot's to figure out like how many chunks per potentially distributed/sharded index and how we would reduce. I have had great success with the ResultsHeap class in faiss to "reduce" searches over sharded index(es). I have thought though that xarray could be well suited to this type of problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants