-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Besu Execution timeout/missed attestations #4400
Comments
Is this potentially the same issue as #4398? |
I'm seeing timeouts and missed attestations without any JSON-RPC errors or exceptions, fwiw |
Anecdotal evidence in Discord of users experiencing this issue with very high speced machines. Likely only tangentially tied to resource usage. |
I'm missing attestations, but I see no indication of besu not having enough resources. I have dedicated hardware (6c ryzen, 32gb, 2tb 970 evo plus) that is fast enough I filtered out the following topics from the debug log to see if I could make sense of the rest: |
Could this lead to Teku logging empty slots, as mentioned in Consensys/teku#6205 ? |
I'm not sure I'm in the right place, but I'll paste my logs. Things were executing normally until this happened:
And then it just repeats that last line over an over again, never moving on to a new head block. |
validator.log Thanks |
Agree this is probably a different issue to #4398 although maybe some ppl have both issues. Can you post your timeout logs please @ColinCampbell? |
This is the output from Teku. The only thing interesting from the Besu logs at that time is.a valid fork-choice-update message, but I'm getting those once every so often without corresponding timeouts in the Teku logs:
Edit: I suppose it could have something to do with Teku asking Besu for data while Besu is busy dealing with the event causing the EngineForkchoiceUpdated log? |
For what's it worth, I've experienced this with Prysm+Besu combo. I've seen this problem both with and without the JSON-RPC error reported. Initially post merge with the instance still running from pre-merge, I've experienced this issue with no errors that I can remember. Upon restarting (post merge) besu and the machine it was operating on, I've started experiencing the same degradation in vote performance, this time with the JSON-RPC errors coming up constantly. Linux reported loads never exceeded 1.0 Plenty of CPUs plenty of RAM. Not sure if For Prysm, the beacon chain's errors started with this for about the first hour after merge:
and progressed to being these in the following hour:
Times are in UTC |
@schonex thanks for the report. What's your attestation performance % like? |
I have added the logs at the exact moment of the missed attestation, system config and other background: |
I am sad to say that at the time, after a few hours of this abysmal performance i switched back to geth which i kept on standby for exactly such a case. Here are my grafana stats for the time period of just before merge, post merge, and my switchover to geth (meaning it's not the system resources). Times are in UTC. |
Any update from the team? |
Here some update from the team. We also still working on engine_newPayloadV1 call performance. |
Is there a rough timeframe for an update? |
Unfortunately, the PR ahamlat created to did not alleviate my missed attestations. Besu/Prysm over here. I hate to say it, but there may be more than one bug. But at least this first fix will narrow down the behavior to find the deeper problems. |
Same here, been running from source with #4410 for ~12hrs now, no effect on missed attestations (Besu/Nimbus). |
Hi everyone - we have narrowed down the root cause to the block import performance. we are working now on improvements to that specific pipeline that should alleviate this issue. Thanks for the patience, we will potentially have some PRs to test ASAP, thank you for bearing with us. We are still targeting a fix for our scheduled Wednesday release, but will have more to share soon. |
Not to beat a dead horse, but wanted to chime in that post 22.7.3 and the new Intel i7 w/ 32g ram and 2tb M.2 NVMe |
You’re not the only one @EvilJordan, |
I'm on 22.7.4 now and missing more attestations than I did with 22.7.2 and 22.7.3, |
Does this bug also affect sync committee? I'm missing about 1/4 of all slots. Not seeing anything particularity interesting in the logs. 10 core AMD, 40 GB RAM, VM running rocketpool
|
Still missing attestations on a daily basis. Looking at logs with the Lighthouse folks we found that observation times were high, which correlated to import times reported in the besu logs as well. Applied the Lighthouse logs analysis:
|
I’m experiencing exactly this. No errors/warnings in Besu logs, but can see import delays of up to 6 seconds which cause roughly 10 missed attestations per day. |
Post v22.10.0 I decided to give besu another chance, unfortunately I am still missing attestations daily. My besu version: ❯ besu --version
besu/v22.10.0/linux-x86_64/openjdk-java-17
2022-11-10 13:56:59.771-05:00 | main | INFO | Besu | Using jemalloc Right after the merge I was facing this issue due to relatively long block import sizes, on blocks that were >250 tx (or high gas usage) some blocks were taking north of 4-5 seconds to import where there's a clear connection to the missed attestations. This doesn't seem to be the case anymore, I've collected all the logs of the missed slots from the last few days and aside from one block which was 7 seconds they all seem to have relatively decent import times so I'm not sure why this continues to happen, logs are otherwise squicky clean:
My sync-mode = "X_CHECKPOINT"
network = "MAINNET"
p2p-enabled = true
p2p-port = 30303
max-peers = 100
host-allowlist = [ "localhost", "127.0.0.1", "::1" ]
rpc-http-enabled = true
rpc-http-port = 8545
rpc-http-host = "127.0.0.1"
rpc-http-cors-origins = [ "*" ]
rpc-http-api=["ADMIN","ETH","NET","DEBUG","TXPOOL","WEB3"]
rpc-ws-enabled = true
rpc-ws-port = 8546
rpc-ws-host = "0.0.0.0"
engine-jwt-enabled = true
engine-jwt-secret = "<redacted>"
engine-rpc-enabled = true
engine-rpc-port = 8551
engine-host-allowlist = [ "localhost", "127.0.0.1", "::1" ]
graphql-http-enabled = true
graphql-http-host = "0.0.0.0"
graphql-http-port = 8547
graphql-http-cors-origins = [ "*" ]
#metrics-enabled = true
#metrics-port = 9545
#metrics-host = "0.0.0.0"
data-storage-format = "BONSAI"
data-path = "/var/lib/besu/mainnet"
Xplugin-rocksdb-high-spec-enabled = true |
@ibhagwan Could you please share your hardware spec as well? |
I’m using an Optimized Cloud Compute instance from Vultr hosting with dedicated 4 vCPUs, 16GB RAM and a 1TB dedicated NVMe block device. Edit: another thing worth mentioning is that the missed attestations seem to be getting worse with uptime, it starts with 97-99% effectiveness and few misses per day to just about ~92% effectivenesss, this already happened twice and aftet restart the situation is clearly better. |
Closing for known performance work elsewhere. |
Description
Basic story, details to come - essentially Besu is constrained in execution time and is hitting CL timeouts that are causing it to miss attestations. This is our working theory, with supporting evidence. The team will update with more below.
The text was updated successfully, but these errors were encountered: