-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement a B-tree #4992
Comments
Hi, I'd like to give this an attempt, is this going to be treemap.rs reimplemented as a B-tree? |
@duckmatt: yeah, it would just expose the same interface as treemap it could start off as another module and eventually replaced the treemap module |
Just to give an update, I haven't disappeared. I'm hoping to get the btree fully functional and uploaded to my github this weekend or next, then the week or so after adapt to the treemap interface, been slow progress as I'm learning the language at the same time :p |
I tried to get a btree working with owned boxes for the nodes, but found it can't work as I need to maintain a list of mutable nodes whilst walking down the tree. Seems the way around this is to either restrict the tree to use an even order, which I guess wouldn't be a problem if the order is always going to be fixed, or use reference counting for the nodes, which i assume will make it a bit less efficient. What would you go for between these? |
@duckmatt the general trend for containers in the standard library has been to avoid |
I'm working on a b-tree implementation as well. @duckmatt Perhaps we can combine our efforts and work together on the implementation? My implementation is nowhere near complete. At the moment, the insert method does split the node when the node is full. However, the node is not properly inserted into its parent node. See I think I chose a proper data structure layout to start with:
At the moment, Comments are more than welcome. I started this b-tree implementation as an exercise to learn Rust, but I think we could turn our efforts into a usable b-tree implementation suitable for the Rust standard library. Source: https://github.com/smvv/rust-btree/blob/master/btree.rs @duckmatt Btw, my nickname is |
@smvv Looks good, that's pretty much where I got to so I figure I'll just make contributions to your code from now on and we can try to get that working. Inserting into a parent is where I found the problem to be, because if the parent is full too then you need to split the parent, then insert into the parent of the parent, and so on until you reach the root or a node that's not full. So to implement this it seems like you need to build a list of mutable nodes as you walk down the tree, you'll add to this every time you come across a full node and clear it each time you come across a not full node. Problem is with this approach you then end up with two mutable reference to the same owned box. The alternative is to either use some interesting flow of data with recursion if it's possible, or to do the splits while walking down the tree (which is only possible with an even order, but I see you're asserting for that already anyway, so this may be the way to go) |
TreeMap uses a recursive approach: https://github.com/mozilla/rust/blob/master/src/libextra/treemap.rs#L589 I think a top-down algorithm would be useful indeed. |
The b-tree successfully inserts 100,000 randomly generated uints, and checks if the key is in the tree at the end. Inserting 100,000 uints followed by 100,000 lookups takes 50 ms on an intel i7 950 at 3.07 GHz. Without the lookups, inserting 100,000 uints takes 40ms. At the moment, removing keys is not implemented, and the b-tree will fail if a duplicated key is inserted (see below). It is important to note that the insert() method does not return true / false for a new / old key. The reason is that the top-down algorithm cannot determine if a key does already exist. If a key is stored in a leaf, the top-down algorithm will discover the uniqueness once it's at the leaf node. However, in order to insert something in a leaf, the top-down algorithm has to split full nodes before it is at the leaf. The question is whether it is useful to return true for a new key that is not inserted before. If it is important, insertion has to perform a find() prior to the splitting. How about defining two methods: one that performs a find() operation for uniqueness prior to insertion, and one for inserting only unique values? @thestinger @duckmatt What is your opinion about this? I chose 20 as minimal degree for the b-tree. It gave a good result for inserting 100,000 and 1,000,000 random uints. I compared it with 40 and 50, but insertion took longer (plus ~10ms for 100,000) for those values. I'm not sure what the optimal value will be. This top-down algorithm is used: http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap19.htm Source: https://github.com/smvv/rust-btree/blob/master/btree.rs |
@smvv What I was doing in my implementation was the search to find where in the current node the position of the child to walk down to was before splitting, I did this with a binary search that returned an enum of either Match(position) or NoMatch(position), position = closest key if it's not a match, if it's a match then I knew it already existed. The thing is, the way I see it you would have to do this search anyway, as you would need to know where in the split nodes to walk down to/insert the new key/value into Here's the bsearch I made if it's of any use: https://github.com/duckmatt/rustbinarysearch/blob/master/bsearch.rs |
I'll give it a try, because I think that would be possible indeed. While walking down, you can find the key in a internal node or a leaf. There should be no need for a second find operation. Although, it will cause a split operation on a node when the node is full, and the key exists already in the b-tree. This is not really an issue because the properties of the b-tree still hold. Your comment gave me another idea: The b-tree uses binary search for finding a leaf, but it does a linear search on the keys in a node. As an experiment, I used binary search on the keys in a node to see if binary search is faster than linear search. The idea behind the experiment is that the run-time increases when the number of keys per node increases. Unfortunately, using binary search on the keys did not reduce the run-time. I tried larger number of keys as well (# of keys I tried is {20, 40, 50}). You can find the disabled binary search here: https://github.com/smvv/rust-btree/blob/master/btree.rs#L142 And important to note is: Using btree.rs as a library does not work: see #8044. |
@smvv: You could try using a galloping search instead. You search by stepping forwards by N where N doubles every time (accessing the indices 0, 1, 3, 7, 15, ...) until you overshoot, and then do a binary or linear search over the range from the last step. |
Also, a linear growth pattern might end up being faster (accessing 0, 1, 3, 6, 10, 15) since the fanout size is a known, small constant. |
@smvv Your structure doesn't seem right to me at a glance. A B-Tree typically stores both values and children in internal nodes. A B+ Tree stores either values or children but requires copying keys and you have claimed that you are not doing so. I haven't analyzed this in great detail so it may or may not work and may or may not have good properties but it doesn't seem to be textbook. I'd also like to give this a try and have a very different representation in mind. Would anyone object if I gave this a try in parallel? I don't want to steal anyone's thunder. The representation I have in mind is here. Nice things about this are:
The main cons I see are:
|
@davidhalperin The problem I see with the b+tree here is that you're going to have two references to an owned box, the right pointer will be pointing to a node that's already owned by a vector. Or are those nodes in the vectors not owned boxes? I'm getting a feeling that's where I went wrong |
@duckmatt You could use a borrowed pointer with a lifetime but then it wouldn't be sendable or you could use some unsafe code. I think DList uses unsafe pointers for backpointers so it'll be sendable. I'm not sure what we want is a B+ Tree, though. I suspect @thestinger was inspired to open this issue by Google's C++ btree implementation which is faster than the normal C++ std::map. Their implementation stores an array of key value pairs in every node. Other that Google's implementation, I'm not aware of B-Trees being used for in memory data structures so I don't know if there's any data on whether a B+ Tree would be advantageous for this. If you have any please correct me. Either way, I think it would be better to start out with something simple, get it working, then try to optimize the representation after we have some tests and benchmarks in place as a baseline. |
If you draw the b-tree with the current implementation, you'll get something like this:
http://www.asciiflow.com/#Draw7058515904656888666/126940731 Perhaps this is not a textbook b-tree, but it does not require copying the keys. If the key is not present, the key is stored in its parent. If the direct parent does not have the key, the key is stored in the parent of the parent. This is a 2-3-4 b-tree (BTREE_MIN_DEGREE = 2). If I understand generics and Rust's enums correctly, a Leaf points directly to its value in the current implementation. There's no pointer between the node's pointer and the actual value. Thus |
I'm taking a look at this issue right now, sort of starting from scratch, but keeping in mind everything that has been posted so far with hopes of having a skeleton implementation to present before too long. I am reading carefully over past contributions to this issue and trying to take everything into account. I'm still a bit of a Rust beginner so any advice would be appreciated! Perhaps this is an ambitious first project but I am excited to see what I come up with. I feel fairly confident with data structures, so this is a place where I felt comfortable starting. |
@niftynif Great! Honestly, I suspect starting from scratch would be a faster way to get started than trying to revive the existing code (of course, reading over it doesn't hurt). Rust has changed somewhat in the past three months. I don't mean to slight the people who worked on this issue before; it's a pretty normal thing for somebody to start fresh when another contributor isn't able to land their patch for any reason. Drop by IRC if you have questions! |
@jruderman Not yet. While @niftynif will hopefully be working on it more, #10303 represents work in progress. |
@niftynif You might have a look at https://github.com/berleon/libtrees . I working on implementing a Lehmann and Yao concurrent B-Tree. As my plan is a disk based implementation, I have to consider locking and reading/writing nodes from/to hard disk. This all adds a lot of overhead. If you have any questions or suggestions, feel free to talk to me. |
I have been working on a btree in rust independently of @niftynif and I am wondering if it could be of some help with this issue. In design it is much closer to @smvv implementation then to the design that is being implemented by niftynif. That is that leaf nodes are the only nodes that hold data, and branch nodes are key and pointer data only. It's much more complete then either, as it implements all of MutableMap and Iterator interface. https://github.com/csherratt/cow-rs/blob/master/src/cow/btree.rs There is a few problem areas, mostly around the fixed sized arrays inside of the nodes which in the current state of rust is difficult to avoid (std::mem::sizeof would need to be compile time, not run time evaluable). |
Great, it's really good to know that others are working on this issue currently. I definitely am not as far along as you are, as you can see from the work I've completed so far. It looks like you're implementing a B+-tree rather than the typical B-tree paradigm, which is pretty cool. |
Well, a B+tree requires next pointers which my implementation lacks. There are some advantages to a real B-tree over what I have done. The largest being that you should not have to clone any of the keys which is a real improvement if the keys are a string or any other allocated data structure. I'm interested to see how performance would compare. In a b+tree the internal nodes are more compact since they lack data values. This should make it faster to search for trees that do not fit in the L1 cache. On the other hand, having data in the internal nodes means that a b-tree will have many opportunities for early outs. Comparing search performance of my btree to a binary tree like treemap I found that treemap was faster up until the tree size grew to about 1000 keys. |
Need not block 1.0. Assigning P-low. |
is this solved? we have std::collections::btree:BTree in tree. |
It's not actually in a useful state yet. |
cc @gankro |
We now have a decent BTreeMap! It should absolutely be optimized and cleaned up more, but it's been designed with that future work in mind. |
Rustup to rust-lang#67853 Specifically caused by rust-lang#67786 changelog: none
A B-tree stores sorted arrays of a chosen size instead of lone nodes, so it can be much more cache-friendly than a red-black tree when used with small key/value types. The size will just have to be hardcoded for now since it can't be part of parametrizing it.
The text was updated successfully, but these errors were encountered: