-
Notifications
You must be signed in to change notification settings - Fork 4.6k
ClusterSlots uses too much memory #17789
Comments
@carllin @behzadnouri What do you guys think? |
Yeah, let me look into this, since it was brought up in #14366 (comment) as well.
But, bit-vec idea is promising. I will give it a shot. |
Yeah I was also thinking maybe you could probably page out:
|
Testing on TDS seems like dropping unstaked nodes does not help much. Cluster slots still has
and with lowest number of staked nodes (from the same snapshot):
|
This seems to be because of crds values from nodes with different |
Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: solana-labs#17789 solana-labs#14366 (comment) solana-labs#14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version.
Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: solana-labs#17789 solana-labs#14366 (comment) solana-labs#14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version.
Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: solana-labs#17789 solana-labs#14366 (comment) solana-labs#14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version.
…on (#17899) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version.
…on (#17899) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version. (cherry picked from commit 985280e)
…on (#17899) (#17916) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version. (cherry picked from commit 985280e) Co-authored-by: behzad nouri <[email protected]>
reopening this since |
Checking on testnet, all the erroneous epoch-slots are from: Probably the node does not halt because However On testnet:
On mainnet:
|
#19190 should mitigate above situation where a node is using same identity key for testnet and mainnet clusters. Also, announced on discord for the nodes to change their identity keys: |
Cross cluster gossip contamination is causing cluster-slots hash map to contain a lot of bogus values and consume too much memory: solana-labs#17789 If a node is using the same identity key across clusters, then these erroneous values might not be filtered out by shred-versions check, because one of the variants of the contact-info will have matching shred-version: solana-labs#17789 (comment) The cluster-slots hash-map is bounded and trimmed at the lower end by the current root. This commit also discards slots epochs ahead of the root.
Cross cluster gossip contamination is causing cluster-slots hash map to contain a lot of bogus values and consume too much memory: solana-labs#17789 If a node is using the same identity key across clusters, then these erroneous values might not be filtered out by shred-versions check, because one of the variants of the contact-info will have matching shred-version: solana-labs#17789 (comment) The cluster-slots hash-map is bounded and trimmed at the lower end by the current root. This commit also discards slots epochs ahead of the root.
Cross cluster gossip contamination is causing cluster-slots hash map to contain a lot of bogus values and consume too much memory: solana-labs#17789 If a node is using the same identity key across clusters, then these erroneous values might not be filtered out by shred-versions check, because one of the variants of the contact-info will have matching shred-version: solana-labs#17789 (comment) The cluster-slots hash-map is bounded and trimmed at the lower end by the current root. This commit also discards slots epochs ahead of the root.
Cross cluster gossip contamination is causing cluster-slots hash map to contain a lot of bogus values and consume too much memory: #17789 If a node is using the same identity key across clusters, then these erroneous values might not be filtered out by shred-versions check, because one of the variants of the contact-info will have matching shred-version: #17789 (comment) The cluster-slots hash-map is bounded and trimmed at the lower end by the current root. This commit also discards slots epochs ahead of the root.
Cross cluster gossip contamination is causing cluster-slots hash map to contain a lot of bogus values and consume too much memory: #17789 If a node is using the same identity key across clusters, then these erroneous values might not be filtered out by shred-versions check, because one of the variants of the contact-info will have matching shred-version: #17789 (comment) The cluster-slots hash-map is bounded and trimmed at the lower end by the current root. This commit also discards slots epochs ahead of the root. (cherry picked from commit 563aec0)
Cross cluster gossip contamination is causing cluster-slots hash map to contain a lot of bogus values and consume too much memory: #17789 If a node is using the same identity key across clusters, then these erroneous values might not be filtered out by shred-versions check, because one of the variants of the contact-info will have matching shred-version: #17789 (comment) The cluster-slots hash-map is bounded and trimmed at the lower end by the current root. This commit also discards slots epochs ahead of the root. (cherry picked from commit 563aec0) Co-authored-by: behzad nouri <[email protected]>
…on (#17899) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version. (cherry picked from commit 985280e) # Conflicts: # core/src/cluster_info.rs
…on (backport #17899) (#19551) * excludes epoch-slots from nodes with unknown or different shred version (#17899) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version. (cherry picked from commit 985280e) # Conflicts: # core/src/cluster_info.rs * removes backport merge conflicts Co-authored-by: behzad nouri <[email protected]>
Problem
Proposed Solution
Can we reduce
CLUSTER_SLOTS_TRIM_SIZE
?Or find a more efficient way to store this information. Instead use a map of pubkey -> BitVec or compressed BitVec?
Page out to disk in files or blockstore?
Consumers:
replay stage -> propagation stats
repair service -> pruning slots
The text was updated successfully, but these errors were encountered: