Blockstore: Migrate ShredIndex type to more efficient data structure #3570

steviez · 2024-11-11T05:57:40Z

Problem

In the Blockstore, we have a column called index. The index column is used for tracking which shreds we have / haven't received for any given slot. The data structure to track this is Index

agave/ledger/src/blockstore_meta.rs

Lines 109 to 113 in 7629a15

    
           pub struct Index { 
        
               pub slot: Slot, 
        
               data: ShredIndex, 
        
               coding: ShredIndex, 
        
           }

which is a BTreeSet<Slot> for both data shreds and coding shreds:

agave/ledger/src/blockstore_meta.rs

Lines 116 to 119 in 7629a15

    
           pub struct ShredIndex { 
        
               /// Map representing presence/absence of shreds 
        
               index: BTreeSet<u64>, 
        
           }

Using the BTreeSet here is expensive due to the allocations, especially in consideration that this type is constantly being serialized & deserialized as we push updates into RocksDb.

Proposed Solution

Instead of a BtreeSet, we can use a simple bit vector of fixed size. This is because we know that a block can have at most 32,768 data and coding shreds:

agave/ledger/src/blockstore.rs

Line 107 in 7629a15

pub const MAX_DATA_SHREDS_PER_SLOT: usize = 32_768;

agave/ledger/src/shred/shred_code.rs

Line 14 in 7629a15

pub const MAX_CODE_SHREDS_PER_SLOT: usize = MAX_DATA_SHREDS_PER_SLOT;

Thus, we can instead use a [u8; 4,096] for each data shred and coding shred. Compared to the current BTreeSet, the break even point in disk footprint would be 512 elements (if we ignore the cost of pointers). That is,

4096 bytes == 8 bytes / u64 * 512

Current MNB usage exceeds this so we'll easily break even on disk footprint, but the real winnings will be reduced allocations and time to serialize / deserialize.

One tricky thing will be rolling this change out. We can obviously just start writing to a new column in some new version (say v2.2). However, if an operator has to roll back to v2.1 (which is unaware that the index is being written to new column), we're likely to hit consistency issues. This will probably need more thought, but some possible ideas are:

Write both columns in the version that this change is introduced in
- Maybe have CLI flag to disable writing to the old column so that we can avoid the extra I/O once the cluster has fully progressed to new version
Write a "converter" that runs at startup to populate old index column from new index column
- This would obviously need to live in the old version (v2.1 in above example)
- Workload could be reduced here since the index column is not accessed once a shred is full, but probably more safely once the slot has been rooted / is older than latest root

Previous Attempt

Lastly, here is an attempt I took at this a few years ago: solana-labs#18575

As the PR shows, the actual change to introduce the new column isn't bad at all. I think we just got to discussing the support-upgrade-and-downgrade problem, and then the PR went stale for likely higher importance issues

The text was updated successfully, but these errors were encountered:

steviez · 2024-11-11T06:38:37Z

Also one quick gotcha - if you let serde handle a large array in the default manner, it will do so element by element which is horribly inefficient.

If we use a BitVec crate, we should ensure that it serializes efficiently and upstream a change if it does not
If we just do our own, something like annotating with Bytes instead of default serde impl should do the trick:

cpubot · 2024-11-28T15:43:59Z

I have an implementation ready to go. Final remaining bit is migrating the blockstore to the new data structure. Chatted with @alessandrod on this, the proposed solution would write the new data structure to a new column, continuing to write the existing data structure to its same column for backwards compatibility. Importantly, the serializing and writing of the existing column would be batched and executed outside of the main execution path (in a separate thread, for example). Ser/deserialization is the most expensive part of the current implementation, so continuing to serialize and write in the main path is counterproductive here. @steviez, any thoughts on this?

steviez · 2024-12-02T15:47:13Z

the proposed solution would write the new data structure to a new column, continuing to write the existing data structure to its same column for backwards compatibility.

New column could definitely work, and yes, still writing to the existing column would be important to allow someone to downgrade from 2.2 to 2.1 (assuming this lands in 2.2) without issue

Importantly, the serializing and writing of the existing column would be batched and executed outside of the main execution path (in a separate thread, for example).

I don't think this approach will work. If we do the serialization and write to the DB in another thread, this implies that this work would occur in a separate WriteBatch. If it occurs in a separate WriteBatch, we break the atomicity-across-column-families guarantee that is one of the main reasons for using the WriteBatch (in addition to amortizing overhead). So, the following scenario could potentially make the node unrecoverable:

Commit some new shreds with updating new (BitVec) index
Send work for old (BtreeSet) index to some other thread
Node stops for some reason (unrelated panic, operator stopped it, OOM, etc)
Operator restarts node with old version to troubleshoot

In this scenario, the SlotMeta and shreds will have been persisted, but the index in the legacy column will not have.

steviez · 2024-12-02T16:42:52Z

Write both columns in the version that this change is introduced in

Maybe have CLI flag to disable writing to the old column so that we can avoid the extra I/O once the cluster has fully progressed to new version

Write a "converter" that runs at startup to populate old index column from new index column

This would obviously need to live in the old version (v2.1 in above example)

Workload could be reduced here since the index column is not accessed once a shred is full, but probably more safely once the slot has been rooted / is older than latest root

I still think one of these approaches might be the right idea. Another option could be splitting the PR up a bit and backporting some of it. Namely:

Create a PR that introduces new column and can read from it and BP to v2.1
- We'd probably try to read the old column first, but fallback to the trying to read from the new column
Create a subsequent PR that write only the new format; this is NOT BP'd to v2.1
- We could swap the order to try to read new column first and then fallback to old column
Operators upgrade to v2.2 eventually and start writing + reading new format
We can then remove the read-from-new-column-and-fallback-to-old-column mechanism in v2.3 or whatever then

This would avoid the extra writing + serialization at the cost of some minimal extra reading in the transient period when we upgrade from v2.1 to v2.2

cpubot · 2024-12-02T17:19:55Z

I still think one of these approaches might be the right idea. Another option could be splitting the PR up a bit and backporting some of it. Namely:

Create a PR that introduces new column and can read from it and BP to v2.1

We'd probably try to read the old column first, but fallback to the trying to read from the new column

Create a subsequent PR that write only the new format; this is NOT BP'd to v2.1

We could swap the order to try to read new column first and then fallback to old column

Operators upgrade to v2.2 eventually and start writing + reading new format

We can then remove the read-from-new-column-and-fallback-to-old-column mechanism in v2.3 or whatever then

This would avoid the extra writing + serialization at the cost of some minimal extra reading in the transient period when we upgrade from v2.1 to v2.2

Recapitulating to make sure I understand correctly.

The idea here is to split the column migration across three releases such that:

Initial release simply adds support for reading the new column as a fallback case, and does no writing of the new column.
- This lays the foundation for a downgrade. For example, assume operators have upgraded to release 2/3 in the chain (bullet point 2), and as such have been solely writing to the new column. In the event of a downgrade, release 1/3 still understands how to read the new column, while continuing to read and write the legacy version.
- This ensures release 1/3 doesn't incur the overhead of serializing the new column, but can still understand and use it in the event of a downgrade.
This release reads and writes the new column as its primary target, yet still understands the legacy column for fallback reads. It does no writing of the legacy column.
- This instantiates the migration. We can safely downgrade to release 1 because it understands how to read the new column written in release 2.
- This ensures release 2/3 doesn't incur the overhead of serializing the legacy column, but can still understand and use it, which is necessary during the partial migration period (i.e., blockstore still contains data in the legacy column format).
Once the release is considered stable and we don't anticipate a downgrade, we can remove the legacy column and its associated fallback reads.

I think this is an elegant solution. We obviate the overhead of serializing and writing two columns while supporting a downgrade strategy. If this sounds good to everyone, I will move forward with this strategy.

steviez · 2024-12-02T19:45:32Z

The idea here is to split the column migration across three releases such that:

Initial release simply adds support for reading the new column as a fallback case, and does no writing of the new column.

This release reads and writes the new column as its primary target, yet still understands the legacy column for fallback reads. It does no writing of the legacy column.

Once the release is considered stable and we don't anticipate a downgrade, we can remove the legacy column and its associated fallback reads.

Yep, what you repeated back is consistent with what I'm thinking

This ensures release 1/3 doesn't incur the overhead of serializing the new column, but can still understand and use it in the event of a downgrade.
This ensures release 2/3 doesn't incur the overhead of serializing the legacy column, but can still understand and use it, which is necessary during the partial migration period (i.e., blockstore still contains data in the legacy column format).

Correct, these two bullets repeat back the key difference in the scheme I outlined. Instead of eating the overhead of writing both columns for an extended period, we only incur the (much lesser in comparison) overhead of an extra read for a couple slots when an operator toggles version (upgrade or downgrade).

cpubot · 2024-12-02T20:03:00Z

Great, I'll move forward with this!

cpubot self-assigned this Nov 26, 2024

steviez changed the title ~~Migrate Blockstore ShredIndex type to more efficient data structure~~ Blockstore: Migrate ShredIndex type to more efficient data structure Dec 1, 2024

cpubot linked a pull request Dec 3, 2024 that will close this issue

Blockstore: Migrate ShredIndex type to more efficient data structure #3900

Open

bw-solana mentioned this issue Jan 3, 2025

Performance: Blockstore #4274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blockstore: Migrate ShredIndex type to more efficient data structure #3570

Blockstore: Migrate ShredIndex type to more efficient data structure #3570

steviez commented Nov 11, 2024

steviez commented Nov 11, 2024

cpubot commented Nov 28, 2024

steviez commented Dec 2, 2024

steviez commented Dec 2, 2024

cpubot commented Dec 2, 2024

steviez commented Dec 2, 2024 •

edited

Loading

cpubot commented Dec 2, 2024

Blockstore: Migrate ShredIndex type to more efficient data structure #3570

Blockstore: Migrate ShredIndex type to more efficient data structure #3570

Comments

steviez commented Nov 11, 2024

Problem

Proposed Solution

Previous Attempt

steviez commented Nov 11, 2024

cpubot commented Nov 28, 2024

steviez commented Dec 2, 2024

steviez commented Dec 2, 2024

cpubot commented Dec 2, 2024

steviez commented Dec 2, 2024 • edited Loading

cpubot commented Dec 2, 2024

steviez commented Dec 2, 2024 •

edited

Loading