-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing support for emptyBlockPeriodSeconds in QBFT (Issue #3810) #6965
Conversation
…ledger#3810) Signed-off-by: Antonio Mota <[email protected]>
Hi @siladu and @matthew1001 can youse please review this one? Cheers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments. Was there any follow up to my earlier comment about what the behaviour should be if a new TX arrives during an empty block interval? I'm not sure waiting the entire remaining empty block interval to include the TX in a block is the right thing to do.
config/src/main/java/org/hyperledger/besu/config/JsonBftConfigOptions.java
Show resolved
Hide resolved
.../common/src/main/java/org/hyperledger/besu/consensus/common/bft/MutableBftConfigOptions.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/hyperledger/besu/consensus/qbft/QbftBlockHeaderValidationRulesetFactory.java
Outdated
Show resolved
Hide resolved
consensus/qbft/src/main/java/org/hyperledger/besu/consensus/qbft/statemachine/QbftRound.java
Show resolved
Hide resolved
Thanks for updating the PR @amsmota I've left a few more comments, the main one being about what the behaviour is when a new TX arrives during the empty block period? |
Hi @matthew1001 , thanks for that. In relation to that comment above I did reply in the previous PR, as per my tests everithing looks OK, please look there the test log. I'll review your new comments ASAP too. Cheers. |
Signed-off-by: amsmota <[email protected]>
Signed-off-by: amsmota <[email protected]>
I've just pulled your latest commits on the PR @amsmota and I'm still seeing the behaviour of a new block not being produced if a TX arrives during the empty block period. E.g. if I have 2 second block interval, 30 second empty block interval the behaviour is:
I'm not sure that's the right behaviour. I think it probably ought to be something like:
It would be good to see what others think though. @siladu any thoughts? |
It is not the behaviour I see in my tests. Did you saw the comment I mentioned above?
Look at non-empty block 52, it was produced after 5 seconds of the the last empty block and then the next one 30 seconds after (an empty-block). This was with
Can you show me the genesis config you used? And if you can post some logs please do. Anyway, that test was 2 weeks ago, I'll repeat the test asap...
|
Signed-off-by: amsmota <[email protected]>
Signed-off-by: amsmota <[email protected]>
Signed-off-by: amsmota <[email protected]>
Signed-off-by: amsmota <[email protected]>
Sorry I couldn't find your previous comment, but that's great if that's the current behaviour. I think I pulled and re-built with your latest changes but I'll check. Here's the QBFT config I used:
I was using a single node to test though. It's possible that just using a single node gives different behaviour? |
Please note that I'm not dismissing your findings, as I said the tests I did (my original post is in the Issue itself) were done 2 weeks ago, there were some merges and commits by now. I will test again as soon as I can, and I will test the single-node scenario. I suspect that won't behave as intended, at least considering that having my 4 nodes network with 1 node not correctly configured gave me all those issues I mentioned above. My local network of 4 nodes was created following this page. The rest of your comments are now all implemented too. I see below those 4 workflows awaiting approval, what should I do to be able to run them? |
Just tested again the emnpty/non-empty block times, it seems good to me. It's based on
The block that contains 1 tx is the line 5: so the flow is 1: --> 30s --> 2: --> 30s --> 3: --> 30s --> 4: --> 10s --> 5: --> 30s --> 6: --> 30s --> 7: --> 30s --> 8:
Anyway, to be sure is better if somebody else confirms this data. |
One thing I notice in those logs above, in the previous logs I copied from tests all(most) blocks, profuced or imported, appeared on the QbftBesuControllerBuilder output, which is not the case now:
I wonder if this is expected ot it may indicate some issue... I actually added some BPs on QbftBesuControllerBuilder and PersistBlockTask to check the upflow but I couldn't see any logic on the diference... |
I've just approved them to run, so they're all going now |
I'll pull your latest commits today and try with a 1 node and 4 node network, but given your findings I expected it's working fine now. |
I'm afraid I found a issue when testing with one node only but unfortunately I couldn't replicate again yet. It seems it happens with 2 non-empty blocks in a row:
Block 74 is created correctly but the next one is not, it should take 10 seconds and it took 30... It's this what you observed @matthew1001 ? I'm testing with remix and metamask, and it's hard to send 2 transactions in a row, is there a more expedite way to do it? Cheers. |
Signed-off-by: amsmota <[email protected]>
I've just pulled your latest changes and re-tested and I do see the same issue still. I've captured some log snippets of Besu with timestamps and a hardhat script with timestamps to can see the timings. This is a 4 validator network with 2 second block period, 30 second empty block period: Here are the Besu logs:
Here is the hardhat script which deploys 10 contracts and prints the timestamp before and after:
So by |
Hi @matthew1001 , that seems different from what I saw in my one-node test, but it's hard for me to test correctly with only remix and metamask. I'm not very familiar with that hardhat, even if I;m installing right now.Can you send me that script so I can test the exact same way as you? Thanks.
|
Here's a test I did today with a one-node network based on a 5/10 seconds intervals. Block 3 (the first non-empty) took `10 seconds when it should take 5. However, all the other non-empty blocks seem OK.
I'll try again to check if that 1st non-empty block is reproducible... |
Here's my latest test with 5/10 seconds, one-node network. I logged some messages to try and understand the issue.
You'll notice that the 2 INCORRECT ones have one thing in common, the 2 QbftBlockHeightManager lines have the first saying
So after staring at the screen and the occasionally banging my head on the table, I remembered the phrase of that Vulcan guy (or was it Sherlock Holmes?) "When you eliminate the impossible, whatever remains, however improbable, must be the truth" and the only explanation I can find is this: AFAICS the way this works is:
The above is what's happening at the moment, with no emptyBlockPeriodSeconds... With it will be:
So, the only explanation for those INCORRECT blocks is that they got their transaction between the 5th and the 10th second after the previous block... It's only logical... If this explanation is correct, then it's very bad because there is no way to correct it because, well, there's nothing to correct, it's just the way it is... There may be some solution but nothing that comes to mind now... Or I may be wrong, of course... What do you guys think? P.S. I wonder if we can create a PendingTransactionAddedListener in the QbftBlockHeightManager, that may do the trick... |
Hi @siladu and @matthew1001 and everybody, I need some help here related to my last comment, at the moment I think this PR is working correctly (if my findings on my last post are correct), I have the idea of having a mempool listener in the QbftBlockHeightManager but I'm afraid that even if that works it'll add too much complexity. Just to recall, the only explanation for the INCORRECT blocks is that the non-empty-bock was created after 10 seconds and not 5 is because the block got the transaction(s) between the 5th and the 10th second after the previous block... Comment PLEASE... |
Thanks for doing some extra testing @amsmota At the moment I'm going to find it hard to put the time in to diagnose things in detail. I'm wondering if we want to consider putting the feature in documented as experimental since the basic behaviour seems to work and the exact behaviour in certain situations just needs sorting out. The main thing would be to ensure that we're confident nothing has regressed in the basic case (i.e. when the user hasn't set @siladu what are your thoughts? @non-fungible-nelson do you have a view on whether this could be released as experimental to get more people trying it out and more focus over the coming releases on ironing out any remaining issues? |
Yeah, I totally understand that, and I thank you for your valuable time. 💯
Well, I don't know what exactly implies being "experimental" but I agree with it...
I'm pretty sure that there is no problem there, but looking at the code I do see that there are no provision or validation for that, I may change it to make it explicitly, and anyway I'm going to test that scenario more exhaustively
That part of having more people trying it out I this it's crucial. Being tested by only a couple of persons is too limited, we really need a broader area of testing... Anyway, I'm going to keep on testing, going to create a "standard" test to send transactions in a predefined way, not randomly like I'm doing now, and improve my 4 nodes network because it's very difficult to aggregate meaningful info from all nodes... Thank youse all for your support. |
I don't have permission to push commits to your branch, but I've pushed your branch + a merge commit to my repo. If you take a look at c38970d you should be able to duplicate or cherry-pick that single commit into your branch. Also you currently have a bunch of commented-out tests in the PR? |
Yeah, I commented on one of your suggestions above. for you to review
|
Ah yes, apologies. I think if they're not needed they can be removed then. And any functions that are currently only called by tests can probably be removed as well. Regarding the IBFT branch, I don't see any changes in the IBFT-specific code. Personally I'm OK with just concentrating on QBFT for now as it's the more commonly used protocol, and will be enough for the feature to be tested out before changing it from experimental in the future. |
…radle props. Signed-off-by: amsmota <[email protected]>
Ok, just pushed with
Please review your change but I think it's good... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - excellent job on a challenging PR!
Could you add a CHANGELOG
entry please? Then I think it's good to merge. I've approved so if you add one feel free to merge the PR.
I just added it... 👍 There's still this message, can you check it?
Cheers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amsmota - just putting a blocker on this so it doesn't prematurely get merged before I get chance to re-review the latest changes, which I will do ASAP.
My most recent outstanding list was this:
#6965 (comment)
I believe you've resolved item (2) and (3), but I'm still not sure about item (1).
If this issue turns out to be limited to the experimental feature, I would be happy to merge the PR even if it requires a follow-up fix, however I just want to double check it's not a regression when the feature is disabled.
Also thanks @matthew1001 for taking the time to complete the review :)
@amsmota I approved the CI run and there are some NPEs in the unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you've resolved item (2) and (3), but I'm still not sure about item (1).
@amsmota I've retested the test case that was concerning me:
When no txs and round timer expires, node2 proposes
- missing QbftBlockHeightManager | Round change from 0x... log messages
and the latest code appears to be working as I would expect now 🎉
At least, I can't detect any difference in round change behaviour/logs compared to the main branch (other than the expected empty block timings).
I want to echo @matthew1001's congrats on sticking with this PR and getting it through! It has also been a weight from my shoulders as I wasn't able to complete this before I had to move on to others tasks :)
I would be happy to merge this in once the unit tests are passing.
...ensus/qbft/src/test/java/org/hyperledger/besu/consensus/qbft/statemachine/QbftRoundTest.java
Show resolved
Hide resolved
Signed-off-by: amsmota <[email protected]>
Signed-off-by: amsmota <[email protected]>
Signed-off-by: amsmota <[email protected]>
Hi @siladu, thanks for that, it was quite strange, I implemend that test back in April but for some reason one of the merges decided to drop one line (only one, line 160 below!!!) that caused this NPE now... I just hope there are no missing lines elsewhere... And thank you for your review and approval... 👍 |
@amsmota I have enabled auto-merge. I don't have permission to merge main into your branch so you'll have to do the final step please...and actually I'll have to reapprove the CI build |
I just did a merge from main, I hope nothing's wrong... Cheers. |
…ledger#3810) (hyperledger#6965) Implemented support for emptyBlockPeriodSeconds in QBFT (Issue hyperledger#3810) Introduces experimental xemptyblockperiodseconds genesis config option for producing empty blocks at a specific interval independently of the value of the existing blockperiodseconds setting. hyperledger#3810 --------- Signed-off-by: Antonio Mota <[email protected]> Signed-off-by: amsmota <[email protected]> Signed-off-by: Wolmin <[email protected]>
PR description
Implementing
xemptyblockperiodseconds
for producing empty blocks at a specific interval independently of the value of the existing blockperiodseconds settingFixed Issue(s)
#3810
Thanks for sending a pull request! Have you done the following?
doc-change-required
label to this PR if updates are required.Locally, you can run these tests to catch failures early:
./gradlew build
./gradlew acceptanceTest
./gradlew integrationTest
./gradlew ethereum:referenceTests:referenceTests