-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What to do about IAVL? #7100
Comments
I think there at two separate issues:
|
Depending on the use-case it would be good to have closer look at Merkle Trees alternatives for storage commitments. I will review the following papers and write a summary next week.
|
I think we're explicitly avoiding RocksDB for various reasons. I do agree we should explore alternatives in logical storage. MMR, sparse merkle trees, and urkle trees all look like good contenders. One thing to keep in mind is that we need:
|
IAVL is very problematic in it's IO patterns. After Amino, it's probably the largest Cosmos SDK performance bottleneck. It also uses a lot of indirection that makes look ups very inefficient. |
@alexanderbez thanks for comment. Is there any resource summarizing your decision about RocksDB? |
Prefix iteration is used extensively throughout the SDK. It's how we do index scans. |
@marbar3778 mentioned the other day that our aim to be database-agnostic is also causing an additional burden on state tree update optimisation. may want to consider limiting database support in conjunction with this issue to leverage database-native efficiencies. cc @musalbas if you have thoughts on use of jellyfish here |
rocksdb is supported. I would not say we are avoiding it because we have done next to zero database configuration testing. It is still unclear if goleveldb is better for the sdk than others (badgerdb, boltdb, rocksdb). There are a couple users who do use rocksdb. Cosmos-SDK, IAVL, and Tendermint has support for 5 databases. Being that the Cosmos-SDK and hopefully Tendermint are considered frameworks for building blockchains, an agnostic approach to supported databases was taken. This is nice but causes some headaches in optimizations. We can provide sane defaults for various dbs but this leads to more testing of which we have none of. I do agree with:
Efficient storage should be a separate issue and have its own discussion. I opened an issue (#6406) to discuss supporting fewer dbs based on testing that needs to be done. This can be a start. |
Ohh I never said it wasn't supported, just from my experience, there are much better alternatives (e.g. LevelDB, Badger, etc..). One thing to keep in mind as well is that there are libraries out there that are very well supported, tested, and benchmarked. Continuing to have to maintain IAVL especially when it's sub-optimal at best, is probably not the best strategy. I would be great if someone would take lead on exploring alternatives. |
Generally, it seems that the blockchain space is heading towards Sparse Merkle trees (SMT) for state trees - including Ethereum 2.0 (although it seems now they're leaning towards polynomial commitments), Libra (jellyfish), LazyLedger plus also Google's Trillian. However, all of these projects are using slightly different designs for the tree, and so these implementations are not necessarily interchangeable. Libra and LazyLedger use a compute-optimised SMT by replacing subtrees with only default values with a single node; Ethereum 2.0 plans to use a compute-optimised tree by using a more efficient hash function with a standard SMT; and Trillian uses a standard unoptimised SMT (afaik). At LazyLedger we have settled on using the Libra SMT, and implemented it in Golang. Libra has an implementation in rust called jellyfish, but it is slightly different as it operates on "nibbles" (where each key is split into sections of 4-bytes each). In LazyLedger's SMT implementation the storage code is abstracted away as a "MapStore" interface, so devs can define their own storage backend. I'm not sure about the implementation complexity of IAVL trees, but I think SMTs are relatively simple and shouldn't be difficult to maintain. The core of the LazyLedger SMT implementation is 400 lines of code. Although we store each node in the tree in the DB, so I'm not sure if that addresses the IO performance concerns. |
I think Ethereum 2.0 big bet is on polynomial commitments. |
@musalbas - why did you settled on jellyfish? |
We're not using jellyfish, we're using our own SMT library, that adopts the same SMT design as specified in the Libra whitepaper. We chose this as it's the simplest tree design we're aware of with a O(log(k)) get/update cost. Vitalik's scheme using an optimised hash function is simpler, but there were some concerns around the security of this, and it still requires a hash operation every 16 levels in the tree. |
Does Jellyfish or LL's compute-optimized SMT allow for prefix iteration/range queries? If so, that seems like a solid bet. |
No I don't think so, because the keys are hashed, and so they should appear in random places in the tree. A range query would thus simply consist of multiple Merkle proofs. |
Prefix iteration can be done separately. It doesn't need to be done on the storage commitment unless you need proofs / commitments while doing the iteration. Range queries can be implemented with prefix iteration. Where the the range proofs are used? What's the use-case for them? Maybe this can be solved with aggregateable commitments (like the aforementioned pointproofs or RSA accumulators). |
@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries. However, I don't know what you mean by |
@alexanderbez , @musalbas - braking a range query into single queries for merkle proofs (assuming this is needed) sounds very inefficient. |
Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree. If you want to prove membership of multiple keys in the tree, that would require an individual Merkle proof per key, regardless if they are within the same "range". It would be helpful to know where range proofs are being used for the state tree though. I understand why range proofs may be helpful for the Merkle tree of transactions, but not sure why they would be needed for the state tree. |
I see. I guess this isn't a problem with IAVL because it's self-balancing... Since SMT hash keys prior to insertion like merkle-patricia trees do, I don't see how we'd have prefix iteration unless we can figure out how to leverage the underlying physical storage somehow :( |
At present, we do not support range proofs (of more than 1 key) in the SDK. Since the subspace queries do not support returning proofs. I brought it up in an issue a while ago here: #5241 And tried to do some work on it, but i couldn't get it to work and there were other more pressing blockers for IBC proofs so I dropped it and worked on those |
I believe if you use the optimised hash function optimisation for SMTs instead of the Libra optimisation, you can have free choice over where keys are stored in the tree, and can thus support range proofs. cc @adlerjohn |
@erikgrinaker is on vacation, but when he's back it would be good to understand more about the current status of IAVL and what's needed to ensure the correctness of that implementation. |
Just adding another option here: It uses RocksDB, supports checkpointing, is still an AVL tree so supports range queries, and has the pieces to support state syncing. |
As far as I know, IAVL is correct - it's just slow and inefficient, and not concurrency-safe. I think this is in part because it tries to do too much itself (e.g. versioning) which should ideally be offloaded to the underlying database engine. The access patterns are also suboptimal, requiring lots of random accesses - I'm not familiar with the state of the art in Merkle trees, but there's a reason databases use e.g. B+trees instead of binary trees. Building a performant data store is non-trivial, and not something we should spend engineering resources on if we can avoid it in my opinion. I'd be happy to help explore alternatives if necessary. |
@erikgrinaker the advantage of B+trees is that in general you have faster lookup (they are more compact). This is especially important for the storage layers, which are not very fast on random access, or where you can't efficiently use hardware caches. |
Exactly. I suspect an AVL tree is inherently inefficient given block devices and CPU caches, but I don't know how Merkelization affects these tradeoffs. |
I see this issue's description already links to libra-jellyfish-merkle, but see also Compact Sparse Merkle Trees, which are a very similar idea: |
I guess github references from discussions → issues don't work just yet, but there is some discussion of this wrt parallelism here: #8134 (comment) Also here's an ethresearch thread on the compact SMT paper tony posted above https://ethresear.ch/t/compact-sparse-merkle-trees/3741 |
I think under the propose separation of concerns (State storage as separate from storage commitments), it would be highly beneficial to consider storage solutions other than simple key/value stores. In particular classical SQL database systems. For my own work I've been coming to the conclusion that it would be REALLY nice to have the state stored in an SQL database, which would allow for the use of multiple indexes and join queries in transaction processing. My understanding is that the mature SQL databases are extremely well supported, highly efficient, with built in support for caching, indexes, ACID compliance, complex queries, etc. They may not be as fast for raw lookups as a simple key/value store, but I would bet any production grade SQL db would out perform IAVL/goleveldb (or any other Merkle tree on top of a kv store) for most modules. Especially any that make use of iterators, since AFAIK each step of the iterator requires (at minimum) another lookup (log(n)) in the nodedb. Whereas getting a series of results from an SQL table is done in a single query and is almost certainly O(log(n)) if the iterated key has an index. An SQL DB would also trivialize the whole "MultiStore" process, by just using multiple tables, rather than a single tree with key prefixes. Table names could easily be prefixed by module name to create the same sort of module based isolation within a shared store. The problem of storage commitments/proofs could then be dealt with separately, with the committed data being something like: I am currently working on a python based proof of concept for implementing an SMT (with batch inserts) on top of sqlite. If that experiment goes well, and there is at least some support for this path, I will likely try to build such a storage service. It would probably be in Haskell, as I'm hoping to use this along with kepler. I also think it's worth planning ahead of time for cross language usage. ie whatever storage system / storage commitment scheme the Cosmos-SDK ends up using should be accessible through a separate service (such as gRPC) so that other language SDK implementations can reuse the same service, and so that replacement services could be constructed later which satisfy the same interface. |
We might just be the first blockchain framework that uses RDBMS 😆 . How would merkle proofs work? Would those come from the storage commitment layer? And if so, how is that tied to the relational/logical storage layer? i.e. how do you prove what's in SQL is actually there and part of a structure? |
Yes, the idea would be for the service to keep the storage commitment layer in sync with what is stored in the database tables. I'm thinking of the storage commitment layer as an abstract cryptographic accumulator. ie something which acts like a Set, with insert/remove operations and a way to construct proofs of membership/non-membership which can be checked against either the accumulator itself, or some digest of it (root hash in the case of a merkle tree style accumulator).
To keep them in sync, at the end of every block, the storage layer would need to send a set of additions/deletions to update the accumulator with. These would be hashes of the serializations of table rows (ie what a CSV file stores), prefixed with the name of the table they were from. So for example:
Would be the result of a full transfer from address2 to address1. So proving that some value is in the accumulator would be equivalent to a proof that the specified table has a row with that value. It's basically like having two stores, one which is fast for inserting data and lookups (sql). And another one which doesn't have to be able to return the inserted data at all, but which is fast for creating/verifying proofs that a value was inserted (accumulator). |
@alexanderbez Chain Core was using Postgres circa 2016, FWIW. Seemed to work pretty well there, and had some nice cloud deployment options (e.g. Amazon Aurora) |
Who would like to take a first stab at an ADR? There are a lot of stakeholders, but I suggest one person own the ADR and solicit further design criteria and feedback from those who have contributed to this thread. If we are indeed going with SMT, perhaps someone from LazyLedger since they already have an initial implementation? |
Yeah, we can surely take the first stab on an ADR and collect the input and feedback from the different stakeholders. @tzdybal started looking into replacing iavl with an SMT in the SDK and is interested in taking the lead on this :-) A shared slack or discord channel with all parties involved could also be helpful. We'd like to understand the different use-cases and priorities better. Should we go ahead and create a shared slack/discord, too? |
Let's start a channel in Discord. |
Great! I can't start a discord channel myself but looking forward to it 👍🏼 BTW regarding that SQL / RDBMS discussion above:
it might not be a "classical" blockchain but google's trillian also comes with a mysql based storage backend (among others). As far as I remember it is not meant to be used in production as it is rather slow but it works fine for testing and developing (and everything is authenticated as with the other backends). Ideally, the APIs / interfaces we settle on would even allow RDBMS backends (although I'm skeptical about whether we'd want to recommend that). |
@hxrts - when doing a report I went through that thread. IIRC, LL SMT is a realization of that idea. |
@hxrts , @liamsi -- I spent good amount of time analyzing this issue and looking for a good solution. So let me outline the thoughts coming from researching this subject. Happy to contribute. If you want I can kick off and port the design described below into ADR. With @aaronc we were thinking about SQL data base with regards to: #7099 For the usecase linked above (#7099) we started with an approach where each message / data will have a corresponding schema (#7097 and ADR-038) to properly assign a merkle proof to an off-chain data. I was thinking about an alternative design, influenced by a TurboGeth work I inspected few months ago with respect to the report I created for this task (#7100 (comment)) and the general approach (#7100 (comment)) Decoupling state commitment from storage
Canonical IDsToday, only a module knows about the ID (key) for every state object. The ADR-038 tries to solve it. I am trying to find another way to solve it by using canonical IDs - an ID constructed directly from a The idea is to use a
I'm not 100% sure if this work, but if not we fall back into the If we stick with the "current" key construction, then I would strongly advocate to hash the key before storing it. SMT for state commitments.This part is more obvious:
Finally we store a Data StoreHere we need few things:
Storage engineKV store are order of magnitude faster than RDBMS. TurboGeth is using KV based indexing (so at the end they have 3 stores). Based on all the existing blockchain implementations, I would suggest to start with porting SMT to Cosmos, update the existing KV interface, and only later consider experimenting with RDBMS. For more sophisticated use-cases, as explained above, we will need a separate RDBMS anyway. We don't want to kill node with analytics operations. |
Algorand is using SQLite as their main state storage (not sure if they are using anything else in fact). I outlined this in overview of storage backends. |
Okay, I can totally buy that a KV store would be an order of magnitude faster than RDBMS for looking up an object by hash. And for modules where that is the access pattern that fully makes sense. For example in my project I have Utxo style transactions, where a transaction specifies a set of hashes of the existing Utxos it will spend, and a set of new Utxos to be created. Storing existing Utxos by hash in a KV store would certainly be faster than using an SQL table with a hash primary key. (I would be curious to know exactly how much faster it be, say to compare in process sqlite with an in process KV store) My desire to use something which supports full SQL queries was about modules/access patterns which require multiple indexes or joins. For example, an order book based DEX, might have a storage schema like: This can of course be replicated on a KV store by manually constructing extra indexes, and using iterators / repeated lookups to simulate joins. AFAIK this is what Cosmos-SDK is currently doing in modules like staking, and is the intention of #7098. This is fine, but it's a lot of extra engineering work to basically replicate what a RDBMS is designed to do. P.S: |
The most important thing is to Decoupling state commitment from storage . Once this will be done, we can have different storage backends (or even 2 at the same time) and experiment further. |
There was a need to use chat / discussion for getting into more details. Let's move here if we need to discuss anything, and keep this issue only for important notes related to decisions. So, here is the Github discussion |
For datastore I’m partial to LMDB or BadgerDB for all the reasons mentioned in the review doc. Faster, feature rich, and MVCC. Having a hard time distinguishing between those two, it looks like BadgerDB has the performance edge. It’s interesting to me that BadgerDB appears to be more popular and battle tested outside of the blockchain space but it appears no popular blockchain has decided to use it (as their default, anyways)? I’m curious if turbo-geth compared those two. Turbo-geth also uses an ETL preprocessing approach to sort keys before writing to the DB in order reduce write amplification, without this I am curious how much more performant LMDB actually is vs LevelDB. I have sort of knee-jerk reaction to using an SQL DB as the primary storage due to performance reasons. An interface (e.g. tm-db) designed for operating on-top of kvstores will "work" for SQL databases but I don't think it will be performant enough for practical use. I can say from experience that substituting Postgres for LevelDB in go-etheruem does not work without a major access pattern overhaul, although SQLite should be significantly faster than Postgres for simple operations. I think the primary DB being embedded is also crucial, that would disqualify most SQL DBs but not SQLite. SQL seems "less risky" as a secondary database, populated by means of ADR 038 etc. For state commitments, I agree SMT looks like the best option! One question, why wouldn't the value in the SMT be the object itself instead of the hash of the object? Still thinking about the Canonical IDs. |
Note that several SQL databases expose some sort of high-performance key/value API which skips the SQL query engine. For example MySQL implements the Postgres has hstore. |
This is linked to meta-issue #7096.
Summary
This issue discusses potential approaches to the ongoing maintenance or replacement of IAVL.
Problem Description
It our understanding that IAVL has effectively become an orphaned project within the Cosmos ecosystem. I don't have a good understanding of all of the issues related to IAVL, but welcome those with more context to share. /cc @ethanfrey @erikgrinaker @alexanderbez
Proposals
Overhaul IAVL
A complete overhaul has been proposed as one solution. I don't have the context fully for why this is needed, but would appreciate those who do to share those details here.
Replace IAVL
Replacing IAVL with a more mature alternative has been proposed as well. I'll just list out the alternatives that have been mentioned here:
The text was updated successfully, but these errors were encountered: