-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alastria Network · Full sync failed · GoQuorum v2.6.0 (and following) #1107
Comments
@alejandroalffer I see references to go1.15.2, making me presume that you have been compiling from sources, could you please instead use either
Then retry and let us know if the issue persists. Thanks |
Thanks for the feedback, @nmvalera! In fact, current geth version is compiled from sources, in a Dockerized ubuntu:18.04
Downloading geth as proposed:
It returns the same issue:
I'll try:
|
Hi! The problem keeps while full syncing using the provided binary :-( . Using fast mode, everything finish in the right way
This was a fresh database, after "geth removedb --datadir /root/alastria/data_DONOTCOPYPASTER" and "geth --datadir /root/alastria/data init /root/genesis.json", and restoring the original enode key.
The binary file you provided:
The binary compiled from myself:
Could any of these files help in the solution?
In order to make the same test using the Docker provided by quorum... could I have access to the original Dockerfile used in https://hub.docker.com/r/quorumengineering/quorum? |
I'm pretty sure the problem it's diferent: this one it's about syncing a new node in the full-way mode, and the @carlosho17 issue it's related with the new storage model for chain database |
Better degub attached:
Geth arguments:
Log error aflter a fresh chaindb install |
Looking at the logs I notice that you haven't cleared the freezer db:
So you're getting the BAD BLOCK on the first block your node is trying to download during the sync (block 8597101). |
Thank you for the answer, @SatpalSandhu61. The problem persist after a clean up of the chaindb: The logs starts in the incorrect number, after restarting geth, only for a make it smaller. |
I believe you may not have fully understood my comment regarding clearing the freezer db. |
Hi just to recap on this issue for the Alastria Quorum Network. We all stumble upon a certain block when using geth 1.9.7 that yields this message DEBUG[01-15|11:28:01.360] Downloaded item processing failed number=8597101 hash=e4a2d7…49b2e5 err="invalid merkle root (remote: 0f6d6606b447b6fd26392f999e84be08fdf8b71f956b83116017dbb371ea1f1a local: 8a6cab008e2572a774a3c1eadc36269fa65662471c088652853db94e38ff8e59)" We have spent last weeks trying all scenarios (fast and full sync, erasing whole data directory and reinitializing with geth init preserving nodekey, fresh new installations, different Ubuntu versions, Quorum tgz package, in-place compilation with Go 1.15 and 1.13 , etc). These tests have been performed not only by us Core Team but also by regular members. The result is always the same: It is block 8597101 where newer quorum finds a bad merkle root and stops syncing. Our workaround is: install older version , let it sync past beyond block 8597101, and then switch to quorum 20.10.x . There is a second workaround which is start fresh but with a copy of the chain past beyond the bad block. What we would like to know if the finding of a bad merkle-root by the quorum 20.10.x is a feature or a bug. Thank you |
The problem still persists anyway in the new version 21.1.0: the sync process stop forever in block 8597100, using full mode. We are using a new database, starting the sync from scratch. The problem is repeated in all cases: $ export PRIVATE_CONFIG=ignore
$ geth --datadir /root/alastria/data --networkid 83584648538 --identity BOT_DigitelTS_T_2_8_00 --permissioned --port 21000 --istanbul.requesttimeout 10000 --port 21000 --ethstats BOT_DigitelTS_T_2_8_00:[email protected]:80 --targetgaslimit 8000000 --syncmode full --nodiscover --metrics --metrics.expensive --pprof --pprofaddr 0.0.0.0 --pprofport 9545 --metrics.influxdb --metrics.influxdb.endpoint http://geth-metrics.planisys.net:8086 --metrics.influxdb.database alastria --metrics.influxdb.username alastriausr --metrics.influxdb.password ala0str1AX1 --metrics.influxdb.tags host=BOT_DigitelTS_T_2_8_00 --verbosity 5 --cache 8192 --nousb --maxpeers 256
I have created a new log file with the last lines: they are repeated forever. In order to progress with this problem, we could allow an enode adress in for developers to do their own testing. The Alastria ecosystem, with more than 120 nodes, is pending this issue to proceed with the version migration. Last few lines from linux console: sync-fails.txt |
Full trace from start of the synchronization:
https://drive.google.com/file/d/1rx7bzJdygwomRBMfRn3Bftczf6nwuAeJ/view?usp=sharing |
Hi, you stated "We are using a new database, starting the sync from scratch.". As mentioned earlier in the thread, the |
Hi @SatpalSandhu61, thanks for the feedback, I promise the directory was empty. However, I have repeated the process, on a newly created path, and the problem repeats: full sync mode hangs at block I have considered the issues linked, and it seems that it's related with a problem in some versions of One last consideration: this is a permanent error, and always reproducible. Alastria has more than 100 active nodes, and the migration process to GoQuorum 20.xx / 21.xx is pending the results of these tests: any help will be appreciated. {
admin: {
datadir: "/home/alastria/data-full",
nodeInfo: {
enode: "enode://beabec74344fc143c9585017c940a94f0b7915024de2d632222e0ef58a1e6c9b3520d2d3e1ada304ef5b1652ba679f2f9686190f83d89d5f81410d0a9680881e@46.27.166.130:21000?discport=0",
enr: "enr:-JC4QHN8R874S81ttpNdPBLM72SF4M0vgyBnSmyhfB9fBcKKXVH9EEfCYGD8-HFY1HTuy0QLzSNL2c7rzCq-a4PHKvgGg2V0aMfGhEXl0IiAgmlkgnY0gmlwhC4bpoKJc2VjcDI1NmsxoQK-q-x0NE_BQ8lYUBfJQKlPC3kVAk3i1jIiLg71ih5sm4N0Y3CCUgg",
id: "3713f5a6c14042c2483ede889f88e36ce70b870ada6087f45b41976527128e62",
ip: "46.X.Y.Z",
listenAddr: "[::]:21000",
name: "Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5",
plugins: {},
ports: {
discovery: 0,
listener: 21000
},
protocols: {
istanbul: {...}
}
},
peers: [],
[...]
eth: {
accounts: [],
blockNumber: 8597100,
coinbase: "0x9f88e36ce70b870ada6087f45b41976527128e62",
compile: {
lll: function(),
serpent: function(),
solidity: function()
},
defaultAccount: undefined,
defaultBlock: "latest",
gasPrice: 0,
hashrate: 0,
mining: false,
pendingTransactions: [],
protocolVersion: "0x63",
syncing: {
currentBlock: 8597100,
highestBlock: 61898986,
knownStates: 0,
pulledStates: 0,
startingBlock: 8597102
},
call: function(),
[...]
version: {
api: "0.20.1",
ethereum: "0x63",
network: "83584648538",
node: "Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5",
whisper: undefined,
getEthereum: function(callback),
getNetwork: function(callback),
getNode: function(callback),
getWhisper: function(callback)
}, Any way to get more faulty node information via the |
Thanks a lot. |
Thanks @nmvalera , we will start to share what you say next week. |
Hi @nmvalera , thanks for the feedback.
Geth/v1.8.18-stable(quorum-v2.2.3-0.Alastria_EthNetstats_IBFT)/linux-amd64/go1.9.5 So far, there have been no updates since that version: only a few nodes are still on version 1.8.2. In fact, we are working on a renewal of the network, in which it is a fundamental part to take advantage of the advantages of the new versions of GoQuorum: bug fixes, monitoring, ...
> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-99f7fd67(quorum-v2.3.0)/linux-amd64/go1.11.13"
> (finish ok) > admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-20c95e5d(quorum-v2.4.0)/linux-amd64/go1.11.13"
> (finish ok) > admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-685f59fb(quorum-v2.5.0)/linux-amd64/go1.11.13"
> (finish ok) > admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-9339be03(quorum-v2.6.0)/linux-amd64/go1.13.10"
> (STOP SYNCING)
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL) v20.10.0 ADDED > admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-af752518(quorum-v20.10.0)/linux-amd64/go1.13.15"
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL) v21.1.0 ADDED > admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5"
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL) All the test are made under this enviroment: root@alastria-01:~# ldd /usr/local/bin/geth
linux-vdso.so.1 (0x00007ffeb65e7000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb6c3f64000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb6c3f59000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb6c3e0a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb6c3c18000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb6c3f8f000)
root@alastria-01:~# uname -a
Linux alastria-01 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@alastria-01:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS" This is a permanent error, and always reproducible for every new node of Alastria - GoQuorum network. Thanks for the help to the entire GoQuorum team at Consensys by supporting Alastria-T network |
Thanks this helps a lot. Can you do the following:
In the meantime, we will dig into the error. |
Thanks @nmvalera! Yes! We've have already tested the proposed workarround: once the database it's fully syncronized (with a version previous of v2.6.0), the binary can be upgraded without problem (just minor changes for metrics arguments). It also works for a direct upgrade from v2.5.0 to v21.1.0. Please, keep the effort in searching for a solution: we are looking that the new Alastria partners can perform a direct synchronization of the nodes in full mode to maintain the "trust" of the network, using lastest versions of GoQuorum. |
Thanks, we are discussing this internally and we will keep you updated (we may require some more information from you at some point, we'll let you know). |
@alejandroalffer please also review the migration docs for upgrading from earlier versions of Quorum to 2.6.0 and above. Bad block can sometimes be caused by not setting (EDIT) |
@alejandroalffer Could you please confirm that the version that you are looking to migrate from Geth/v1.8.18-stable(quorum-v2.2.3-0.Alastria_EthNetstats_IBFT)/linux-amd64/go1.9.5 is not an official GoQuorum version but I imagine an Alastria's own custom fork? Thanks. |
@alejandroalffer @cmoralesdiego Any news on the 2 topics above
? Thanks a lot. |
Hi @nmvalera , we are going to give you feedback next week on the early week. Thanks in advance |
Hi @nmvalera , @cmoralesdiego Sorry for the delay. I've tried to restart the synchronization in full mode using different values for the root@alastria-01:/home/iadmin# diff /root/genesis.json-original /root/genesis.json
18c18,20
< "policy": 0
---
> "policy": 0,
> "petersburgBlock": 10000000,
> "istanbulBlock": 10000000 I've tryed some values... from setting it to The logs shows hundred of messages like
On the other hand, there was a fork for Alastria network, with minor updates in order to improve reporting in EthNetStats, but later versions, based on the same version of geth and new releases of GoQuorum finish the synchronization in full mode without problem: Geth v1.8.18 · GoQuorum v2.2.3 - Alastria version, finish
Geth v1.8.18 · GoQuorum v2.4.0 - Official version, finish
Geth v1.8.18 · GoQuorum v2.5.0 - Official version, finish
Geth v1.9.7 · GoQuorum v2.6.0 - Official version, fails
Geth v1.9.7 · GoQuorum v20.10.0 - Official version, fails
Geth v1.9.7 · GoQuorum v21.1.0 - Official version, fails IMHO, the problem appears in upgrade from Geth1.8.18 to Geth1.9.7 Best regards! |
@alejandroalffer from your log I see
As a comparison, I see the following in my logs when starting a node with these values set in my genesis.json:
(EDIT: 24 Feb) |
Thanks for the feedback, @chris-j-h: You were right: i've been using Some logs... INFO [02-24|23:24:05.143] Initialised chain configuration config="{ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 100000000 TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: 100000000 Istanbul: 100000000 PrivacyEnhancements: <nil> Engine: istanbul}" Alastria network its in block ~
root@alastria-01:~# cat genesis.json
{
"alloc": {
"0x58b8527743f89389b754c63489262fdfc9ba9db6": {
"balance": "1000000000000000000000000000"
}
},
"coinbase": "0x0000000000000000000000000000000000000000",
"config": {
"chainId": 83584648538,
"byzantiumBlock": 0,
"homesteadBlock": 0,
"eip150Block": 0,
"eip150Hash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"eip155Block": 0,
"eip158Block": 0,
"istanbulBlock": 100000000 ,
"petersburgBlock": 100000000,
"constantinopleBlock": 100000000,
"istanbul": {
"epoch": 30000,
"policy": 0,
"petersburgBlock": 0,
"istanbulBlock": 0
},
"isQuorum": true
},
"extraData": "0x0000000000000000000000000000000000000000000000000000000000000000f85ad594b87dc349944cc47474775dde627a8a171fc94532b8410000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c0",
"gasLimit": "0x2FEFD800",
"difficulty": "0x1",
"mixHash": "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
"nonce": "0x0",
"parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"timestamp": "0x00"
} The start script: VER="v21.1.0"
export PRIVATE_CONFIG="ignore"
/usr/local/bin/geth --datadir /home/alastria/data-${VER} --networkid 83584648538 --identity REG_DigitelTS-labs_2_2_00 --permissioned --port 21000 --istanbul.requesttimeout 10000 --ethstats REG_DigitelTS-labs_2_2_00:[email protected]:80 --verbosity 3 --vmdebug --emitcheckpoints --targetgaslimit 8000000 --syncmode full --gcmode full --vmodule consensus/istanbul/core/core.go=5 --nodiscover --cache 4096 2> /tmp/log.${VER} And the result: pi@deckard:~ $ md5sum log.v21.1.0.gz
8a5d2b1355b3e0c0690e2aafa263781f log.v21.1.0.gz
[log.v21.1.0.gz](https://github.com/ConsenSys/quorum/files/6040662/log.v21.1.0.gz) There's another point, and maybe its not relevant: using values
Thanks again for not giving up! Best regards! |
Making the log from @alejandroalffer's above comment #1107 (comment) clickable: log.v21.1.0.gz |
@alejandroalffer as you noted earlier your logs have a large number of There are also quite a few other Do you see any of these when doing a full sync from block 0 with your current Alastria version or pre-v2.6.0 Quorum?
|
Hi! I've used GoQuorum v2.5.0: the last version in which full synconization finish right. As you know, its based in Geth v1.8.18. The "bad block" is reached and passed, and the logs in root@alastria-01:/tmp# zcat log.v2.5.0.gz |grep "VM returned with error"|cut -f3- -d" "|sort|uniq -c
20 VM returned with error err="contract creation code storage out of gas"
22267 VM returned with error err="evm: execution reverted"
21 VM returned with error err="evm: max code size exceeded"
106 VM returned with error err="invalid opcode 0x1b"
27 VM returned with error err="invalid opcode 0x1c"
21 VM returned with error err="invalid opcode 0x23"
7 VM returned with error err="invalid opcode 0x27"
17 VM returned with error err="invalid opcode 0x4f"
4 VM returned with error err="invalid opcode 0xa9"
3 VM returned with error err="invalid opcode 0xd2"
2 VM returned with error err="invalid opcode 0xda"
7 VM returned with error err="invalid opcode 0xef"
17 VM returned with error err="invalid opcode 0xfe"
3823 VM returned with error err="out of gas"
130 VM returned with error err="stack underflow (0 <=> 1)"
6 VM returned with error err="stack underflow (0 <=> 13)"
2 VM returned with error err="stack underflow (0 <=> 3)" The full log here, log.v2.5.0.gz root@alastria-01:/tmp# md5sum log.v2.5.0.gz
505f207b66846dc4e20170cd70bd7561 log.v2.5.0.gz BTW... the process hangs near block 10.000.000, because [...]
"istanbulBlock": 10000000,
"petersburgBlock": 10000000,
"constantinopleBlock": 10000000,
[...] Thanks again! |
@alejandroalffer said:
These values should be a future block that hasn't been seen yet. In an earlier comment you said you set the values to |
Let’s try and track down exactly where the state is deviating from what is expected:
If you can share each of these outputs we can do some comparisons and see where the state is deviating. If you run into any problems it may be easier to discuss on the Quorum Slack. Feel free to msg me if needed. |
Hi @chris-j-h , and the rest of GoQuorum team... Answering the questions, i make a summary:
The full log:
Keep in touch! Thanks again! |
Hi! I update the status of the issue with the news of this week, to share it with the Alastria and Conensys teams. We made some search in the chaindb (thanks for the snippet, @chris-j-h), in order to find if the method used in the badblock, for (i = 1; i < 9999999; i++) {
hexBlockNumber = "0x" + i.toString(16)
txs = eth.getBlockByNumber(hexBlockNumber, true).transactions
for (j = 0; j < txs.length; j++) {
if (txs[j].input.slice(0, 10) == "0xd30528f2") {
console.log("tx calling method 0xd30528f2 found in block " + i)
}
}
} The result it's that this trasacction appears several blocks behind, for 657 in total: tx calling method 0xd30528f2 found in block 7809408
[...]
tx calling method 0xd30528f2 found in block 9231310 It seems that this transaction it's not related with the synchronization problem :-( As resume, by @chris-j-h:
We'll keep searching for the out-of-memory crashes in order the get the results from RPC API: debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6c') and debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6d') And this references:
Any other suggestion will be also appreciated Thanks again @chris-j-h! |
I'm assuming this has been fixed now, feel free to re-open if that is not the case |
It is not the case, I'm facing the same issue in a different quorum network. |
Can you raise a fresh ticket with genesis and param details. Along with the exact version it stopped working at. |
Root cause identified (at least one of the consensus issues that cause invalid merkle[state] root error): Quorum version 2.7+ and at least up to version 21.10.2 marks account as dirty only if it was NOT deleted in the same block: In our case bug manifested due to multiple ecrecover's in the same block. After the first call to 0x1 account it is added to the state, then removed after the transaction as empty (same behavior for both nodes). After the second call to 0x1 account, it is added to the state, then in the old node it is marked as dirty and removed after the tx, while the newer node does not mark it as dirty, and leaves it in the state, which results in the different final states. And if someone will stumble here looking for a fix, here it is: master...Ambisafe:quorum:21.10.2-fix |
@lastperson good spot - do you want to submit a pull request? |
@antonydenyer I'am now checking if the latest node will sync with the fixed version, or if it has the same issue. If it will have the same issue, then I guess this fix could only be introduced as a fork configuration. |
I'm testing uploading the Alastria Quorum (v1.8.18) to new version (v2.7.0 and v20.10.0).
But the chain syncronization fails, in full mode. Fast mode finish right.
We use the well-know genesis node for Alastria node:
https://github.com/alastria/alastria-node/blob/testnet2/data/genesis.json
And the command line looks like:
But can't get past block 8597100. It happens in these both version upgrades we are testing:
The log its almost the same in both versions:
This problem does not happen with the current stable Alastria version: the full syncronization finish right:
Is necessary to be able to recreate the chain in full mode before in order to upgrade the network clients.
Full log of fail sincronization, until "BAD BLOCK" message in:
FULL LOG
log.err.txt.gz
FULL LOG
Related links:
The text was updated successfully, but these errors were encountered: