-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more efficient state storage #254
Comments
Testnet-1.13.2 was killed by the OOM-killer (Out Of Memory) after about 3000 pixel-painting messages. The last kernel checkpoint was 218_878_899 bytes long, and the mailbox state was 10_107_407 bytes long. Node.js defaults to a 1.5GB heap, and the process was running on a host with 2GB of RAM. The last mailbox took 2.5s to serialize, and the kernel+mailbox checkpoint took 10s to write to disk. Unfortunately the kernel checkpoint was truncated by a non-atomic write-file routine, so we don't know exactly how long the transcript was, but I expect it was some small multiple of Dean's 3000 messages. I don't know why the mailbox was so big: Dean's client should have been ACKing everything, so have to imagine there was another client configured at the beginning which then disconnected, so the mailbox was accumulating gallery-state updates sent to the unresponsive client. The new pubsub technique (#64) should fix that. |
Capnproto is kind of aimed at parsing data out of a serialized buffer with minimal copies, which isn't easy to take advantage of in something like Javascript. It seems better suited for communication than for state persistence. I proposed using SQLite for storage: it has strong transaction semantics, a schema, and stores everything in a single statefile. The @dtribble pushed back against the overhead of creating string-ish SQL statements. The overall performance improvements to (somehow) implement are:
|
I suggest having a look at https://github.com/Level/level Note especially that the storage-like interface is optimisable with an atomic |
I don't know if this is correct but I came across it today and it seems applicable: https://twitter.com/el33th4xor/status/1164215472784584709?s=20 Emin Gun Sirer says that "LevelDB does at most a few hundred tps, sustained. " |
I wonder if a |
I'm not sure how Emin is counting things, but it looks like people are pushing back: https://twitter.com/DiarioBitcoin/status/1164259484505559040?s=20 Emin says: "Indeed, there's a difference between DB benchmarks and crypto benchmarks. The latter exhibit poor locality." Tony Arcieri has some recommendations here: https://twitter.com/bascule/status/1164319302515691520 |
For the way SwingSet manages state, I think we only need one transaction per block (and blocks are created about once every 5 seconds). We shouldn't be committing anything smaller than a block at a time. Reads are another question, but I imagine it's easier to get higher throughput on reads than on writes. |
Dean and I came up with a plan:
|
David Schwartz talked a little about XRPLedger's DB usage. They tried RocksDB first, and it seems to be a big improvement over LevelDB. (LevelDB was really client focused, and we need on more server focused). But they needed more because of scale, and so developed a DB optimized for their ledger use case: https://github.com/vinniefalco/NuDB. Some discussion about it here: https://www.reddit.com/r/cpp/comments/60px64/nudb_100_released_a_keyvalue_database_for_ssds/df8i6rk/?context=8&depth=9 Given the RocksDB has transactions and a few other relevant features, it's worth looking at it first. |
A feature of Rocks DB that appears very relevant:
Support for Optimistic transactions and both sync and async mode in the Node integration (https://github.com/dberesford/rocksdb-node) also seem relevant. |
I think I'm going to start with LevelDB, since the wrapper package ( I don't think we need super-duper performance or those extended features, at least for now. We'll have one transaction every 5 seconds (one per block). But I'm happy to revisit this once we've written some performance benchmarks and get some concrete data. |
After Agoric/cosmic-swingset#109, the next lowest hanging fruit is probably to special-case vat transcripts. We need to read them at startup, but after But going all the way to a proper synchronous DB would be better. |
in the old repo. this was cosmic-swingset issue 65 |
Dean managed to kill a local testnet with a 165MB
swingstate.json
, which caused a V8 out-of-memory error duringJSON.parse
(provoked by a loop that kept modifying pixels every 10 seconds for about 3000 iterations). There are three things we should do to improve this:I don't know that CBOR is the way to go, but we're also looking for a serialization scheme for the message arguments that can accomodate binary data (hashes, etc) without the goofy "base64 all the things" that a lot of projects use as a fallback. Given our community, CapnProto or Protobufs is plausible, but we rarely have schemas for the data, so we'll need their self-describing modes, and I don't know how well they handle that style.
Another issue is that writing out the whole state at any point seems like a large operation, and it'd be nice to do something more incremental. We originally tried storing the state in cosmos-sdk "keepers", which nominally made smaller changes, but the overall efficiency hit of recording lots of little internal changes (hundreds per turn) cratered the performance (fixed in https://github.com/Agoric/SwingSet/issues/94 by switching back to one big JSON.stringify per big-turn).
Once our
kernelKeeper
/vatKeeper
data formats settle down a bit (https://github.com/Agoric/SwingSet/issues/88 is rewriting them), we might be able to do something more sophisticated: record deltas in a JS array as well as making changes to a big table, and then save-state becomes "write the deltas to a separate file" until we've written enough of them to warrant rewriting the main file instead. Assuming we don't mess up the consistency of the two representations, that should allow the incremental writes to be pretty small.The text was updated successfully, but these errors were encountered: