-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Can you explain what "storage chain" means? Why is IPFS involved? What purpose it serves, aka what is the threat model? |
The idea is to maximize on-chain storage efficiency. Here's Gav's writeup on the feature: v1.0: Finite history direct storage chain:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to be sure, the PR introduce:
- an alternate block storage mode config (not switchable) that allows addressing extrinsics directly by their hash.
- pruning for block body (in both mode).
Wondering a bit about addressing by something like 'BlockID ++ Extrinsic index' instead of extrinsic hash.
But I guess it doesn't make sense if extrinsic/transaction hash is going to be replaced with ipfs address and support ref counting?
'Transaction' usage in naming and description instead of block extrinsic confused me a little, but it is probably related to the targeted functionalities (which I am unaware of).
/// State pruning settings. | ||
pub state_pruning: PruningMode, | ||
/// Block pruning settings. | ||
pub keep_blocks: KeepBlocks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_pruning ? body_pruning ? for name consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike state pruning, the only setting here is the number of blocks to keep. So I went with more straightforward name.
let mut hashes = Vec::with_capacity(body.len()); | ||
for extrinsic in body { | ||
let extrinsic = extrinsic.encode(); | ||
let hash = HashFor::<Block>::hash(&extrinsic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some collision possible here? two identically encoded extrinsic between two different blocks.
Edit: just viewed the 'ref counted' comment in the pr description, just not 100% sure it is related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collisions will be handled by reference counting indeed. ParityDb should use internal reference counting mechanism and for RocksDb there will be an explicit counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this explicit counter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean this counter is added automatically by kvdb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not implemented yet. This will require a new extrinsic format. Will be added in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this require a new extrinsic format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The native code will have to distinguish between extrinsics that actually contain new data and the ones that reference a previous extrinsic. This will probably require some special version of OpaqueExtrinsic
.
Quoting Gav:
so one way would be introducing a new kind of extrinsic which is like a "by-reference" extrinsic.
and assumes that the node either already stores the preimage or that it can easily find it from other nodes.
this might also be useful for block propagation and alleviate the need to gossip the entire block if you know most of your peers already have the transactions in it
Also according to Gav, this should not be handled by the runtime, but rather by the database/networking layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay, makes sense.
client/db/src/lib.rs
Outdated
fn new(db: Arc<dyn Database<DbHash>>) -> ClientResult<Self> { | ||
fn new( | ||
db: Arc<dyn Database<DbHash>>, | ||
transaction_storage: TransactionStorage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transaction_storage: TransactionStorage | |
transaction_storage: TransactionStorage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'TransactionStorage' name confused me a little, my first thought was that it was a specific storage instance.
Maybe 'Mode' or 'Config' naming variant or 'StoreTransaction' (similar to 'KeepBlock') could be alternative naming.
client/db/src/lib.rs
Outdated
if finalized >= keep.into() { | ||
let number = finalized.saturating_sub(keep.into()); | ||
match read_db(&*self.storage.db, columns::KEY_LOOKUP, columns::BODY, BlockId::<Block>::number(number))? { | ||
Some(body) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case where we want to remove the transactions but not the body reference? (from description not really)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think so
Right, IPFS content is addressable by content hash. And the idea with renewal transactions is that that they put the same content on the chain by increasing the reference counter.
My understanding that we use the term "transaction" for user generated pieces of block body. And "extrinsic" should not be user facing. So I've used "transaction" in the database layer to underline the fact that this is intended for storing blocks of user data, addressable by data hash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding that we use the term "transaction" for user generated pieces of block body. And "extrinsic" should not be user facing. So I've used "transaction" in the database layer to underline the fact that this is intended for storing blocks of user data, addressable by data hash.
Makes me a bit curious about this 'Finite history direct storage chain' purpose.
I am not really sold to using a different name (transaction) in the context of this storage mode.
Anyway, code looks good to me.
What does it store? And why? Ignoring parachains internal archetecture, there are three storage security models proposed for Polkadot:
If I understand, you've designed an archival-ish storage system with zero security guarantees, meaning PVFs cannot ever depend upon data stored here, yes? In other words, anytime a PVF wants data stored here it must slurp said data back into its own block, which then makes the data available for validators. If so, why should this require Polkadot integration? Also, you expire data after 7 days. Why archive such a short time? You're doing so as a parachain. Why? As a parachain, this pushes its data into the availability store, meaning this costs a parachain exactly what placing data into its own blocks costs.
What does this mean?
So there is zero user data in these block? Only IPFS addresses for which storage nodes store the data? If so, yes this makes sense as a storage model. :) Why would this require any special polkadot integration though? We cannot use this data from polkadot directly so why not do everything inside the parachain's runtime? Is the 7 days just an accounting feature?
What does this mean?
I'm confused: A priori, I'd envision this being some parathread, so surely paying the parathread fee mostly suffices, no? Are we talking huge amounts of data here? Like what?
Is the block author hosting all the user data? If not, they're not doing much work, so no reason to pay them. We've no mechanism to enforce they serve the data obviously, but that's fine since we'll never depend upon the data.
Who are these? Not Polkadot full nodes obviously. It's whoever joins this parachain?
That's useful. Sorry that got long.. In short, I'm unsure why this requires any polkadot integration, given that polkadot cannot depend upon archival-like guarantees. Just fyi, we'll later expand availability store usage for required data: At present, anytime a parachain calls set_code we require all polkadot nodes see its code immediately. In theory, parachains could post_code with their own parachain candidate block, which then stores the new code into the erasure coded availability store. After we finalize the relay chain block including post_code parachain candidate, then the parachain could switch_code into the new code. At this point, we've perhaps only one polkadot validator who possesses the code. Yet, any who require this code could fetch it like approval checkers fetch the blocks they check, thanks to the availability store. We'll require this for hierarchical/multiple relay chain operation, and maybe it'd make polkadot more efficient even now, but currently we're hoping code churns slowly enough to make it worth storing parachain code on the relay chain. |
@burdges This PR does not mention polkadot or parachains anywhere. This is to add generic support for such chains in substrate. The idea is to store user data not in the state merkle tree, but rather as block bodies. Runtime only keeps track of data hashes but has no access to the data itself. This way storage would be way less expensive. @gavofyork can probably explain rationale better than myself. Users send data as transaction payload, which is added to the block database (hosted by all full nodes) and may be retrieved by CID (data hash) over IPFS. |
Alright sounds fine then. I'm still unclear on why this needs special support, but anyways it breaks nothing afaik. :) |
What purpose does the PoW serve here? Any block production constraint suffices, no? |
@burdges right. Initial implementation will use existing consensus at least. Not sure why @gavofyork mentioned PoW there, but it is out of the scope for now.
Special support would be in the database/networking layer to enable querying and reference-counting individual transactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Mostly nitpicks.
We should make sure that this appears in the next polkadot release as information for the users that the db format changes.
client/db/src/utils.rs
Outdated
@@ -327,6 +327,23 @@ pub fn read_db<Block>( | |||
}) | |||
} | |||
|
|||
/// Remove database column entry for the given block. | |||
pub fn remove_db<Block>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this naming follows the read_db
function above, but it is really confusing to me IMHO.
Isn't this more a remove_block_from_db
or similar?
If I understand this correctly, this will remove a block entry from the db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is generic and may remove anything that is referenced with col_index
. E.g. justifications. I'll rename it to remove_from_db
let mut hashes = Vec::with_capacity(body.len()); | ||
for extrinsic in body { | ||
let extrinsic = extrinsic.encode(); | ||
let hash = HashFor::<Block>::hash(&extrinsic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean this counter is added automatically by kvdb?
client/db/src/lib.rs
Outdated
if let KeepBlocks::Some(keep_blocks) = self.keep_blocks { | ||
// Always keep the last finalized block | ||
let keep = std::cmp::max(keep_blocks, 1); | ||
if finalized >= keep.into() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if finalized >= keep.into() { | |
if finalized >= keep.into() { | |
return Ok(()) | |
} |
Than we don't require that much indentation :D
client/db/src/lib.rs
Outdated
columns::BODY, | ||
BlockId::<Block>::number(number) | ||
)?; | ||
debug!(target: "db", "Removing block #{}", number); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be moved above the remove_db
call, because that one could fail and we would not see this log message.
client/db/src/lib.rs
Outdated
match self.transaction_storage { | ||
TransactionStorageMode::BlockBody => {}, | ||
TransactionStorageMode::StorageChain => { | ||
match Vec::<Block::Hash>::decode(&mut &body[..]) { | ||
Ok(hashes) => { | ||
for h in hashes { | ||
transaction.remove(columns::TRANSACTION, h.as_ref()); | ||
} | ||
} | ||
Err(err) => return Err(sp_blockchain::Error::Backend( | ||
format!("Error decoding body list: {}", err) | ||
)), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match self.transaction_storage { | |
TransactionStorageMode::BlockBody => {}, | |
TransactionStorageMode::StorageChain => { | |
match Vec::<Block::Hash>::decode(&mut &body[..]) { | |
Ok(hashes) => { | |
for h in hashes { | |
transaction.remove(columns::TRANSACTION, h.as_ref()); | |
} | |
} | |
Err(err) => return Err(sp_blockchain::Error::Backend( | |
format!("Error decoding body list: {}", err) | |
)), | |
} | |
} | |
} | |
if let TransactionStorageMode::StorageChain = self.transaction_storage { | |
match Vec::<Block::Hash>::decode(&mut &body[..]) { | |
Ok(hashes) => { | |
for h in hashes { | |
transaction.remove(columns::TRANSACTION, h.as_ref()); | |
} | |
} | |
Err(err) => return Err(sp_blockchain::Error::Backend( | |
format!("Error decoding body list: {}", err) | |
)), | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would really prefer to keep match
here for the same reason match
is required to be exhaustive in rust. Code update will be required In case a new variant is added to the enum.
Co-authored-by: Bastian Köcher <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
let mut hashes = Vec::with_capacity(body.len()); | ||
for extrinsic in body { | ||
let extrinsic = extrinsic.encode(); | ||
let hash = HashFor::<Block>::hash(&extrinsic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this require a new extrinsic format?
bot merge |
Trying merge. |
Is there a general level tracking issue for the storage chain? I think it's very helpful for our use case: We need to store a series of "snapshot" for the encrypted confidential contract states and do re-encryption periodically. Since they are just snapshots, only the recent ones are actually useful, and therefore we would like to just prune the older snapshots by default for all the full node. It can make our blockchain much much lighter and more efficient. |
Initial groundwork towards supporting storage chains in substrate. This includes transactions addressable in the database and block/transaction pruning.
Future PRs will include:
polkadot companion: paritytech/polkadot#2253