-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blockstore: Migrate ShredIndex type to more efficient data structure #3570
Comments
Also one quick gotcha - if you let
|
I have an implementation ready to go. Final remaining bit is migrating the blockstore to the new data structure. Chatted with @alessandrod on this, the proposed solution would write the new data structure to a new column, continuing to write the existing data structure to its same column for backwards compatibility. Importantly, the serializing and writing of the existing column would be batched and executed outside of the main execution path (in a separate thread, for example). Ser/deserialization is the most expensive part of the current implementation, so continuing to serialize and write in the main path is counterproductive here. @steviez, any thoughts on this? |
New column could definitely work, and yes, still writing to the existing column would be important to allow someone to downgrade from 2.2 to 2.1 (assuming this lands in 2.2) without issue
I don't think this approach will work. If we do the serialization and write to the DB in another thread, this implies that this work would occur in a separate
In this scenario, the SlotMeta and shreds will have been persisted, but the index in the legacy column will not have. |
I still think one of these approaches might be the right idea. Another option could be splitting the PR up a bit and backporting some of it. Namely:
This would avoid the extra writing + serialization at the cost of some minimal extra reading in the transient period when we upgrade from v2.1 to v2.2 |
Recapitulating to make sure I understand correctly. The idea here is to split the column migration across three releases such that:
I think this is an elegant solution. We obviate the overhead of serializing and writing two columns while supporting a downgrade strategy. If this sounds good to everyone, I will move forward with this strategy. |
Yep, what you repeated back is consistent with what I'm thinking
Correct, these two bullets repeat back the key difference in the scheme I outlined. Instead of eating the overhead of writing both columns for an extended period, we only incur the (much lesser in comparison) overhead of an extra read for a couple slots when an operator toggles version (upgrade or downgrade). |
Great, I'll move forward with this! |
Problem
In the
Blockstore
, we have a column calledindex
. Theindex
column is used for tracking which shreds we have / haven't received for any given slot. The data structure to track this isIndex
agave/ledger/src/blockstore_meta.rs
Lines 109 to 113 in 7629a15
which is a
BTreeSet<Slot>
for both data shreds and coding shreds:agave/ledger/src/blockstore_meta.rs
Lines 116 to 119 in 7629a15
Using the
BTreeSet
here is expensive due to the allocations, especially in consideration that this type is constantly being serialized & deserialized as we push updates into RocksDb.Proposed Solution
Instead of a
BtreeSet
, we can use a simple bit vector of fixed size. This is because we know that a block can have at most 32,768 data and coding shreds:agave/ledger/src/blockstore.rs
Line 107 in 7629a15
agave/ledger/src/shred/shred_code.rs
Line 14 in 7629a15
Thus, we can instead use a
[u8; 4,096]
for each data shred and coding shred. Compared to the currentBTreeSet
, the break even point in disk footprint would be 512 elements (if we ignore the cost of pointers). That is,Current MNB usage exceeds this so we'll easily break even on disk footprint, but the real winnings will be reduced allocations and time to serialize / deserialize.
One tricky thing will be rolling this change out. We can obviously just start writing to a new column in some new version (say v2.2). However, if an operator has to roll back to v2.1 (which is unaware that the index is being written to new column), we're likely to hit consistency issues. This will probably need more thought, but some possible ideas are:
Previous Attempt
Lastly, here is an attempt I took at this a few years ago: solana-labs#18575
As the PR shows, the actual change to introduce the new column isn't bad at all. I think we just got to discussing the support-upgrade-and-downgrade problem, and then the PR went stale for likely higher importance issues
The text was updated successfully, but these errors were encountered: