Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockstore: Migrate ShredIndex type to more efficient data structure #3570

Open
steviez opened this issue Nov 11, 2024 · 7 comments · May be fixed by #3900
Open

Blockstore: Migrate ShredIndex type to more efficient data structure #3570

steviez opened this issue Nov 11, 2024 · 7 comments · May be fixed by #3900
Assignees

Comments

@steviez
Copy link

steviez commented Nov 11, 2024

Problem

In the Blockstore, we have a column called index. The index column is used for tracking which shreds we have / haven't received for any given slot. The data structure to track this is Index

pub struct Index {
pub slot: Slot,
data: ShredIndex,
coding: ShredIndex,
}

which is a BTreeSet<Slot> for both data shreds and coding shreds:
pub struct ShredIndex {
/// Map representing presence/absence of shreds
index: BTreeSet<u64>,
}

Using the BTreeSet here is expensive due to the allocations, especially in consideration that this type is constantly being serialized & deserialized as we push updates into RocksDb.

Proposed Solution

Instead of a BtreeSet, we can use a simple bit vector of fixed size. This is because we know that a block can have at most 32,768 data and coding shreds:

pub const MAX_DATA_SHREDS_PER_SLOT: usize = 32_768;

pub const MAX_CODE_SHREDS_PER_SLOT: usize = MAX_DATA_SHREDS_PER_SLOT;

Thus, we can instead use a [u8; 4,096] for each data shred and coding shred. Compared to the current BTreeSet, the break even point in disk footprint would be 512 elements (if we ignore the cost of pointers). That is,

4096 bytes == 8 bytes / u64 * 512

Current MNB usage exceeds this so we'll easily break even on disk footprint, but the real winnings will be reduced allocations and time to serialize / deserialize.

One tricky thing will be rolling this change out. We can obviously just start writing to a new column in some new version (say v2.2). However, if an operator has to roll back to v2.1 (which is unaware that the index is being written to new column), we're likely to hit consistency issues. This will probably need more thought, but some possible ideas are:

  • Write both columns in the version that this change is introduced in
    • Maybe have CLI flag to disable writing to the old column so that we can avoid the extra I/O once the cluster has fully progressed to new version
  • Write a "converter" that runs at startup to populate old index column from new index column
    • This would obviously need to live in the old version (v2.1 in above example)
    • Workload could be reduced here since the index column is not accessed once a shred is full, but probably more safely once the slot has been rooted / is older than latest root

Previous Attempt

Lastly, here is an attempt I took at this a few years ago: solana-labs#18575

As the PR shows, the actual change to introduce the new column isn't bad at all. I think we just got to discussing the support-upgrade-and-downgrade problem, and then the PR went stale for likely higher importance issues

@steviez
Copy link
Author

steviez commented Nov 11, 2024

Also one quick gotcha - if you let serde handle a large array in the default manner, it will do so element by element which is horribly inefficient.

  • If we use a BitVec crate, we should ensure that it serializes efficiently and upstream a change if it does not
  • If we just do our own, something like annotating with Bytes instead of default serde impl should do the trick:

@cpubot cpubot self-assigned this Nov 26, 2024
@cpubot
Copy link

cpubot commented Nov 28, 2024

I have an implementation ready to go. Final remaining bit is migrating the blockstore to the new data structure. Chatted with @alessandrod on this, the proposed solution would write the new data structure to a new column, continuing to write the existing data structure to its same column for backwards compatibility. Importantly, the serializing and writing of the existing column would be batched and executed outside of the main execution path (in a separate thread, for example). Ser/deserialization is the most expensive part of the current implementation, so continuing to serialize and write in the main path is counterproductive here. @steviez, any thoughts on this?

@steviez steviez changed the title Migrate Blockstore ShredIndex type to more efficient data structure Blockstore: Migrate ShredIndex type to more efficient data structure Dec 1, 2024
@steviez
Copy link
Author

steviez commented Dec 2, 2024

the proposed solution would write the new data structure to a new column, continuing to write the existing data structure to its same column for backwards compatibility.

New column could definitely work, and yes, still writing to the existing column would be important to allow someone to downgrade from 2.2 to 2.1 (assuming this lands in 2.2) without issue

Importantly, the serializing and writing of the existing column would be batched and executed outside of the main execution path (in a separate thread, for example).

I don't think this approach will work. If we do the serialization and write to the DB in another thread, this implies that this work would occur in a separate WriteBatch. If it occurs in a separate WriteBatch, we break the atomicity-across-column-families guarantee that is one of the main reasons for using the WriteBatch (in addition to amortizing overhead). So, the following scenario could potentially make the node unrecoverable:

  • Commit some new shreds with updating new (BitVec) index
  • Send work for old (BtreeSet) index to some other thread
  • Node stops for some reason (unrelated panic, operator stopped it, OOM, etc)
  • Operator restarts node with old version to troubleshoot

In this scenario, the SlotMeta and shreds will have been persisted, but the index in the legacy column will not have.

@steviez
Copy link
Author

steviez commented Dec 2, 2024

  • Write both columns in the version that this change is introduced in

    • Maybe have CLI flag to disable writing to the old column so that we can avoid the extra I/O once the cluster has fully progressed to new version
  • Write a "converter" that runs at startup to populate old index column from new index column

    • This would obviously need to live in the old version (v2.1 in above example)
    • Workload could be reduced here since the index column is not accessed once a shred is full, but probably more safely once the slot has been rooted / is older than latest root

I still think one of these approaches might be the right idea. Another option could be splitting the PR up a bit and backporting some of it. Namely:

  • Create a PR that introduces new column and can read from it and BP to v2.1
    • We'd probably try to read the old column first, but fallback to the trying to read from the new column
  • Create a subsequent PR that write only the new format; this is NOT BP'd to v2.1
    • We could swap the order to try to read new column first and then fallback to old column
  • Operators upgrade to v2.2 eventually and start writing + reading new format
  • We can then remove the read-from-new-column-and-fallback-to-old-column mechanism in v2.3 or whatever then

This would avoid the extra writing + serialization at the cost of some minimal extra reading in the transient period when we upgrade from v2.1 to v2.2

@cpubot
Copy link

cpubot commented Dec 2, 2024

I still think one of these approaches might be the right idea. Another option could be splitting the PR up a bit and backporting some of it. Namely:

  • Create a PR that introduces new column and can read from it and BP to v2.1

    • We'd probably try to read the old column first, but fallback to the trying to read from the new column
  • Create a subsequent PR that write only the new format; this is NOT BP'd to v2.1

    • We could swap the order to try to read new column first and then fallback to old column
  • Operators upgrade to v2.2 eventually and start writing + reading new format

  • We can then remove the read-from-new-column-and-fallback-to-old-column mechanism in v2.3 or whatever then

This would avoid the extra writing + serialization at the cost of some minimal extra reading in the transient period when we upgrade from v2.1 to v2.2

Recapitulating to make sure I understand correctly.

The idea here is to split the column migration across three releases such that:

  1. Initial release simply adds support for reading the new column as a fallback case, and does no writing of the new column.
    • This lays the foundation for a downgrade. For example, assume operators have upgraded to release 2/3 in the chain (bullet point 2), and as such have been solely writing to the new column. In the event of a downgrade, release 1/3 still understands how to read the new column, while continuing to read and write the legacy version.
    • This ensures release 1/3 doesn't incur the overhead of serializing the new column, but can still understand and use it in the event of a downgrade.
  2. This release reads and writes the new column as its primary target, yet still understands the legacy column for fallback reads. It does no writing of the legacy column.
    • This instantiates the migration. We can safely downgrade to release 1 because it understands how to read the new column written in release 2.
    • This ensures release 2/3 doesn't incur the overhead of serializing the legacy column, but can still understand and use it, which is necessary during the partial migration period (i.e., blockstore still contains data in the legacy column format).
  3. Once the release is considered stable and we don't anticipate a downgrade, we can remove the legacy column and its associated fallback reads.

I think this is an elegant solution. We obviate the overhead of serializing and writing two columns while supporting a downgrade strategy. If this sounds good to everyone, I will move forward with this strategy.

@steviez
Copy link
Author

steviez commented Dec 2, 2024

The idea here is to split the column migration across three releases such that:

  1. Initial release simply adds support for reading the new column as a fallback case, and does no writing of the new column.
  2. This release reads and writes the new column as its primary target, yet still understands the legacy column for fallback reads. It does no writing of the legacy column.
  3. Once the release is considered stable and we don't anticipate a downgrade, we can remove the legacy column and its associated fallback reads.

Yep, what you repeated back is consistent with what I'm thinking

This ensures release 1/3 doesn't incur the overhead of serializing the new column, but can still understand and use it in the event of a downgrade.
This ensures release 2/3 doesn't incur the overhead of serializing the legacy column, but can still understand and use it, which is necessary during the partial migration period (i.e., blockstore still contains data in the legacy column format).

Correct, these two bullets repeat back the key difference in the scheme I outlined. Instead of eating the overhead of writing both columns for an extended period, we only incur the (much lesser in comparison) overhead of an extra read for a couple slots when an operator toggles version (upgrade or downgrade).

@cpubot
Copy link

cpubot commented Dec 2, 2024

Great, I'll move forward with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants