Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the best way to insert ordered u32s in a RoaringBitmap? #301

Closed
Kerollmops opened this issue Dec 3, 2024 · 7 comments · Fixed by #306
Closed

What's the best way to insert ordered u32s in a RoaringBitmap? #301

Kerollmops opened this issue Dec 3, 2024 · 7 comments · Fixed by #306

Comments

@Kerollmops
Copy link
Member

Kerollmops commented Dec 3, 2024

This is an open question about this library.

We are extensively using RoaringBitmaps at Meilisearch, and we have very special usage of bitmaps mixed with bit-packing. Bitpacking is primarily used to compress ordered lists of u32s and consume less memory. But when merging multiple bit-packed lists, we retrieve a bunch of ordered (and unique) u32s that we want to insert in roaring bitmaps efficiently.

The following lines show the way we union the ordered u32s into an existing bitmap:

while let Some(block) = iter.next_block() {
    let iter = block.iter().copied();
    let block = RoaringBitmap::from_sorted_iter(iter).unwrap();
    *del |= block;
}

What could be the best and most efficient way to insert a bunch of ordered u32s into an existing bitmap?
I wonder if #288 is related to this and if we should just convert this u32s list into a bitmap.

@lemolatoon
Copy link
Contributor

The implementation of RoaringBitmap::from_sorted_iter seems that it continuously call push_unchecked of each container, store. In case BitmapStore, it will write a bit for each call of push_unchecked. It has the overhead of bitmask and bitshifting. In case ArrayStore, just push to the Vec<u16> their is almost no overheads (except the gradual allocation of vec). If all the stores are bitmap, converting the indices into bitmask in advance would be good for the performance. If they have some ArrayStore, we have another overheads indicies -> bitmask -> indicies.

The solution for this might be calling BitmapStore::from_lsb0_bytes within RoaringBitmap::from_sorted_iter, I guess.

@Kerollmops
Copy link
Member Author

Hey @lemolatoon 👋

The solution for this might be calling BitmapStore::from_lsb0_bytes within RoaringBitmap::from_sorted_iter, I guess.

Thank you very much for your PR again. I'll also ask the help of Daniel Lemire on that just to see what he thinks is the best solution for that before diving into the code. I am still wondering if it doesn't cost too much to convert integers to bitmaps to a union just after. Isn't there a more direct way?

Have a nice day 🦃

@Kerollmops
Copy link
Member Author

Hey @lemire 👋

I'm curious if you can assist me with this topic. I came across the Bitmap::lazy_batch method in the croaring wrapper, allowing multiple operations on bitmaps efficiently without converting containers to arrays. However, this method is most effective when working with pre-existing bitmaps. In my scenario, I have ordered slices of u32 data, aligned in memory, and typically sized at 32/128/256.

Implementing a lazy batch system to insert these ordered u32 blocks into a single roaring bitmap would be beneficial. Do you have insights on the optimal approach for creating bitmaps from sorted u32s?

Have a nice day 🌮

@lemire
Copy link
Member

lemire commented Dec 9, 2024

As what is best, it is not possible to answer this question in the abstract. If you have a specific workload, you can tell what works best, but if you just tell me that you have a sorted array of integers... it is not possible to know very much.

@Kerollmops
Copy link
Member Author

Kerollmops commented Dec 9, 2024

Thank you very much @lemire. I'll take a look at these functions and will probably take inspiration from that to implement the same in this library.

About the workload. In the new Meilisearch indexer we decided to compress integers with bit packing (with the bit packing library, inspired by your work) to reduce the memory usage of the engine when indexing and sequentially, in parallel, union multiple bit packed (using Bbbul) ordered sets of integers into roaring bitmaps (original issue comment) to finally serialize them to disk.

@lemire
Copy link
Member

lemire commented Dec 9, 2024

with the bit packing library,

That's interesting. (Update: I knew about it at some point in the past.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants