-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lmdb: Support integer flags (MDB_INTEGERKEY/MDB_INTEGERDUP) #11
Comments
Comment imported from my #1 regarding design considerations
|
One solution may be to expose an interface and one(?) variant of *.Put that use interface values.
Questions:
Edit: In addition to Put variants a Get variant must also be included. Edit: In addition there could potentially be UintMulti and UintptrMulti functions for batching integer values, but they would require being copied into |
My branch, bmatsuo/value-interfaces experiments with the interface proposed above. However, the results are not encouraging. Writing byte slices containing big-endian encoded uint64 values seems to be as good or better than writing size_t values. Writing unsigned int values was consistently faster, but the difference in performance is close to negligible (< 200ns per Put operation, ~17%). When you go on to consider that Put operation time is small compared to txn commit time the benefits are not clear. The following table shows the speed of different Put operations, PutValue with Uint (u), PutValue with Uintptr (z), and Put with []byte (b). The uppercase benchmarks U, Z, and B are "optimized" versions of their lowercase counterparts which do not allocate in excess.
|
Hi Bryan, It seems to me that you are mostly considering write speed for this issue. I have a use case that I think could benefit from INTEGERKEY on the read side. Specifically I want to store update logs for transactions in LMDB itself. Seems like a DBI with INTEGERKEY and APPEND when doing the put would give me fast writes (only one comparison with APPEND anyways) and faster scan when reading. Re portability of LMDB: did you ever look into that? |
@lmd Thanks for the input. Honestly I hadn't really considered the benefits for a read-heavy workload. Do you have a benchmark program that I look at and maybe use to help validate the performance? A C program would be ok if that is what you have. Can you tell me about the read access pattern you had in mind? I suppose the performance benefits of the optimization would be ideal with heavy random access. Large range scans may benefit too, but it seems like there are fewer total key comparisons per read in that case unless I'm mistaken. As far as portability goes, I haven't thought too much about it. I don't know the internals very well. It would be possible to write code that will compile on all supported architectures without os/arch dependent code. So portability in that sense would not be a problem. But in terms of shipping code/data to different machines, I'm not sure. If the libc implementation is the same then I think it should be OK. =\ edit: I don't think shipping the data between different os/arch combinations is safe. I think you could only ship data between two linux/amd64 machines, or two linux/arm machines. |
I don't have code unfortunately, still thinking about the whole issue. Found this from hyc: https://gist.github.com/hyc/9810a14488c71e58038c (note that INTEGERKEY has only 31 of N keys found, not sure what's going on). Seems like there is a nice speed up for the 10m keys case (~23 comparisons per insertion / lookup). I was thinking that maybe it's possible to ship between amd64 and arm64, haven't gotten around to testing that. If it works it's a hack and relies on struct alignment... (Maybe this should be a separate issue?) |
Put me on the list for wanting integer keys, specifically for a read heavy workload that often includes ranges. Portability of the database file is not a concern for me. |
Thanks for the input @mranney. Indeed portability is not a show stopper for this issue. I have gotten the branch with this experimentation working again, bmatsuo/value-interfaces. The tests and benchmarks are passing. Please go ahead and try it out if there is a particular dataset or access pattern would like to benchmark. I will work on putting together some read-heavy benchmarks of my own. The relevant bits are the new Value type and related functions for that type (including Edit: I also just added the Txn.GetValue() and Cursor.GetValue() functions which allow you to perform reads using integers. |
@lmb I'm a little confused about what I'm seeing benchmarked |
I realized that the experimental raft backend I have worked on uses a database with uint64 keys. On a 32-bit machine LMDB cannot store a 64-bit integer key/value (so a naive implementation isn't portable, in fact). But I figured I would take a shot switching it to lmdb.IntegerKey to see how performance changed. In my measurements the numbers really didn't change. That is disappointing. Note that in order to get things to work you need to merge the value-interface branch with a tenative fix I have, from #57.
I have added some benchmarks for Txn.GetValue and they look somewhat promising (ignoring the above for now). You see essentially the same gains in wall time duration as the Txn.PutValue benchmarks. But reads don't have as much overhead for initialization and termination, so the overall benefit should appear greater. It is just that (now remembering the above table) the benefits are not visible. So I am still on the fence about how much real benefit a Go program can see from this change. I will try to write a few more benchmarks. But I am worried that everything gets washed out by overhead in the library and in cgo (which is increased now with the builtin argument checking). But, I have seen the pattern employed by my raft-mdb fork elsewhere for db interface implementations. It is a potentially read heavy application but each read is executed in a different read transaction (hyperbolic but true in seemingly many cases). I think this pattern may be responsible for washing out any performance change. |
I've become aware of a development that is related to all of this. golang/go#6907 It looks like it will be possible to get a pointer to the raw bytes backing a string in go1.7. I remarked on the inability to do this earlier in the thread here. The ability to push strings into LMDB would be pretty big (particularly for keys, PutReserve essentially solved the problem for string values already). Along with optimizations that @lmb has been kicking ass helping out with, I think the interface approach here is looking more attractive. |
The issue I mentioned before, golang/go#6907 didn't make it into go1.7. By the look of things I have doubts that it will make the go1.8 release. However, go1.8beta1 was just release significant [improvements|https://beta.golang.org/doc/go1.8#performance] have been made to cgo. The reduced overhead may make this feature finally worth implementing/maintaining. My initial benchmarks from the branch bmatsuo/value-interface are promising. The benefit to reads seems the most significant. The speedup of uint values over bytes is most clear. I will continue to monitor the situation as the go1.8 release cycle continues. The following compares the bmatsuo/value-interface benchmarks on go1.7.4 vs go1.8beta1.
|
I am coming very far along in my implementation of this enhancement. My current implementation shows the consistent performance improvements you would expect from this optimization. I probably won't get the code merged until after go1.8 has been released. But that will probably be good because of the cgo optimization it will contain. The GitHub branch bmatsuo/values-interfaces is up to date if anyone wanted to see it. |
@mranney @lmb I am planning on merging support for unsigned integer values this weekend (approx 2017-02-19). If wanted to take a look at #99 and give any feedback I'd appreciate it (obviously, I don't mind if you don't want to or don't have time). I just wanted to give fair warning since this has been something in the works for a while. |
Sorry for the late reply, do you have a high level description what the new API would look like? It's quite fiddly to go through all the commits. |
Forgive me, @lmb. The proposed API has indeed changed significantly since its first inception, and the branch carries that lineage with it. The change is also rather significant in size, which adds to its unwieldiness. I wish there was an easy way to generate a diff at the godoc level. That would make illustrating the changes somewhat easier. There is a high-level example that basically gives an overview of the new API features. Perhaps that would be one of the most useful things to look at. https://github.com/bmatsuo/lmdb-go/pull/99/files#diff-61105a4fb9b9f0a342528e599475968fR165 To support IntegerKey and IntegerDup I defined two types, CUint and CSizet, which are implemented as byte arrays. The byte array implementation means its easy to construct slices which can be passed to Put, Get, ... Because the normal Get and Put methods are still used I had to define functions to extract unsigned integers from their return values. To do this I added an set of functions ValueTT which are discussed in the main godoc https://github.com/bmatsuo/lmdb-go/pull/99/files#diff-afbc8465ec7e9f5f191d2ac1ca79f551R84 I also added two minor types, MultiCUint and MultiCSizet, which allow convenient use of unsigned integers with the Cursor.PutMultiple method. |
I would like to attempt support for
MDB_INTEGERKEY
andMDB_INTEGERDUP
(Seemdb_dbi_open
). Any user attempting benefit from them would be in tricky "unsafe" territory without more direct support. But the two flags actually open a wide variety of database schemata (and query structure).The API extension to support the flags needs be thought out carefully.
The text was updated successfully, but these errors were encountered: