-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? #4228
Comments
@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual. Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect. |
@gitmachtl - can I ask, are all these 'V7' systems deployed using the same scripts? Just seeing if there could be a common configuration issue that underlies this: |
@njd42 as far as i can tell - and i have spoken to a bunch of those SPOs - they are all running completely different systems. different deploy scripts, different locations, different kind of hardware (vm, dedicated server, ...). the number of nodes reporting are only the ones that are actively reporting them to pooltool, but the height battles can be confirmed by others too. |
For what it's worth: we are running our own compiled node in Docker (own Dockerfile) on AWS ECS and migrated to 1.35.1 (too) early. Haven't done the amount of investigation that @gitmachtl has so I can only share sparse findings now, but what I could see is that our BP would mint a block omitting the preceding block (thus getting dropped by the next leader) even though there was sometimes 10+ slots in between. Our relays never received that preceding block (checked logs) so naturally the BPs also didn't. Will keep an eye on this issue and report when I've done more checks |
Example: block 7530726 Block 7530725 arrived at our BP:
Then our block minted and reached our relays (three each on 1.35.1, on with p2p enabled). Then 18 slots later the block from 4f3410f074e7363091a1cc1c21b128ae423d1cce897bd19478e534bb arrived with hash 3150a0d4502f1d0995c4feae3fcb93dd505671cf8dc84bbf0b761e2ee64d70dc and our relays switched fork. The number of incoming connections on our relay is healthy and our block propagation seems otherwise fine. |
@rene84 what if you have two relay, one in 1.34.x and one in 1.35.x? does your bp receive that preceding block? |
I think that the connectivity partition is a much more likely root cause than an issue with chain selection given the evidence at hand. However, such a partition is highly unlikely to occur through pure random chance, which suggests some sort of trigger. If you scan through your logs looking for, say |
Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes? |
Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to #4226. In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1. It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles? |
As far as i know the SPOs were running 1.35.1 nodes in non p2p mode on mainnet. It is really a strange thing that 1.35.* nodes keep on sticking with the block of other 1.35.* nodes and not picking up a resolved 1.34.* block. 1.35.2 was not used i guess, its just too fresh out of the box. |
Could be, but the same relays/BPs later on are producing normal blocks. |
I am not sure if we're mixing up two different issues here. I would love to hear @coot 's oppinion on the height battle issues on the mainnet, its really outstanding and not just a glitch. |
We haven't seen such long heights battles before, it certainly deserves attention. |
The strange thing is, that with those double height battles, the 1.35.* only builds on the lost block of the 1.35.* before. Normally it should pick up the block from the previous winner of course. And the fact that it happend than again is really bad. Creating mini forks all the time. I was on a full 1.35.1 setup at this time on mainnet, relays and BPs. I reverted back to a full 1.34.1 setup and had no issues like that anymore since than. I know that this is hard to simulate, because on the Testnet we now are in babbage-era and all nodes are 1.35.*. On the Mainnet we have the mixed situation. But SPOs are now aware, and it could lead to the case that SPOs are only updating in the very last moment to stay out of such bugs. Thats not a good solution to go into this HF i would say. |
|
What is consistent with what I see in the logs is that some blocks just never make it even after 10 or more seconds. Both incoming and outgoing I've seen that. Seems more likely due to a propagation issue in the network eg 1.34.x and 1.35.x not talking rather then there being a bug in the fork selection mechanism in the node itself |
Can I ask for logs from a block producer which was connected to a |
Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1 |
I am not sure if this will help, but please send. It's more interesting to see |
I searched for |
@renesecur you need to enable |
We would like to get logs from a |
One more favour to ask, those of you who are running |
thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment. |
I have a day job as well... sorry... i wish i could go 100% Cardano ;-) This is clearly a "just in case" vs. an "emergency" downgrade, and I been running it for a month now, but I will get to it eventually. |
lool sorry, Alonzo! I meant Alonzo :disappear: |
That should be easy to test by configuring a 1.34 relay to only have a single 1.35 node as its upstream peer. |
Additionally, a new golden test for the alonzo fee calculation has been added, using the block from: IntersectMBO/cardano-node#4228 (comment)
Additionally, a new golden test for the alonzo fee calculation has been added, using the block from: IntersectMBO/cardano-node#4228 (comment)
Additionally, a new golden test for the alonzo fee calculation has been added, using the block from: IntersectMBO/cardano-node#4228 (comment)
Additionally, a new golden test for the alonzo fee calculation has been added, using the block from: IntersectMBO/cardano-node#4228 (comment)
I did write "resolves #2936 in the ledger PR notes that I think fixes the bug, but I did not realize that those magic words would close issues cross-repo. I think it's best to wait for more testing before this node issues is closed. sorry. |
This resolves node #4228 update plutus to the tip of release/1.0.0 This delay SECP256k1 builtins Updated changelogs and cabal files with 1.35.3 Added changelogs and cabal files updates to version 1.35.3
I want to clarify that what I said here ☝️ is incorrect. This was actually an accidental hard fork. Blocks produced by nodes with the bug would potentially not be valid according to a 1.34 node. If the bug had instead been that a higher fee was demanded, then it would have been a soft fork, since 1.34 nodes would still validate all the blocks from nodes with the bug. |
A number of fixes have been added to 1.35.3, have added them to the changelogs. update ledger to the tip of release/1.0.0 This resolves node #4228 update plutus to the tip of release/1.0.0 This delay SECP256k1 builtins Updated changelogs and cabal files with 1.35.3 Added changelogs and cabal files updates to version 1.35.3 Bump block header protocol version Alonzo will now broadcast 7.2 (even though we are actually moving to 7.0, this is to distinguish from other versions of the node that are broadcasting major version 7). Babbage will now broadcast 7.0, the version that we are actually moving to.
A number of fixes have been added to 1.35.3, have added them to the changelogs. update ledger to the tip of release/1.0.0 This resolves node #4228 update plutus to the tip of release/1.0.0 This delay SECP256k1 builtins Updated changelogs and cabal files with 1.35.3 Added changelogs and cabal files updates to version 1.35.3 Bump block header protocol version Alonzo will now broadcast 7.2 (even though we are actually moving to 7.0, this is to distinguish from other versions of the node that are broadcasting major version 7). Babbage will now broadcast 7.0, the version that we are actually moving to.
@gitmachtl is it ok to close this issue now? |
Yes, was resolved with 1.35.3. Thx. |
Strange Issue
So, i will post a bunch of pics here, screenshots taken from pooltool.io. Blockflow is from botton->top. These are only a few examples, there are many more!
This is happening on mainnet right now running 1.35.* nodes. There are many occasions with double & tripple height battles where newer v7 1.35.* nodes are picking up the wrong parent-block and try to build on another v7 block. So v6 1.34.* nodes are winning those all the time.
I personally had the issue loosing 10 height-battles within a 1-2 day window against v6 nodes. That was 100% of all my height-battles i had.
Its always the same pattern: There is a height battle that a v7 node looses against a v6 node. If there are also two nodes scheduled for the next block and one of them is a v7 node, it picks up the wrong lost block hash from the previous v7 node and builds on it. Of course it looses against the other v6 node which is building on the correct block. But as you can see in the example below, this can span multiple slotheights/blocks⚠️
This is a Vasil-HF blocker IMO, because it would lead to the situation that SPOs are only upgrading to 1.35.* at the last possible moment before the HF, giving the ones staying on 1.34.1 an advantage. Not a good idea, it must be sorted out before. Q&A team please start an investigation on that asap, thx! 🙏
Here is a nightmare one, v7 built ontop of other v7 node (green path):
The text was updated successfully, but these errors were encountered: