DB corrupted: Corruption: block checksum mismatch #10915

zet-tech · 2019-07-24T13:00:26Z

Parity Ethereum version: 2.5.5 (and occurred in 2.4.x as well)
Operating system: Linux (Ubuntu 18.04)
Installation: binaries from GitHub
Fully synchronized: yes
Network: ethereum
Restarted: yes

We encounter following issue:
2019-07-23 14:46:06 DB corrupted: Corruption: block checksum mismatch: expected 3848331410, got 863857200 in /mnt/HC_Volume_2853858/eth/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/073385.sst offset 50078209 size 8611. Repair will be triggered on next restart

stack backtrace:
0: 0x555d9984087d -
1: 0x555d9983d94e -
2: 0x555d99977617 -
3: 0x555d99882d74 -
4: 0x555d99882aee -
5: 0x555d9989f805 -
6: 0x555d998a574b -
7: 0x555d98bc0106 -
8: 0x555d99322f5d -
9: 0x555d98bd2b01 -
10: 0x555d991022a9 -
11: 0x555d990d935b -
12: 0x555d99105a9d -
13: 0x555d9918ce74 -
14: 0x555d9989e54e -
15: 0x555d998a10db -
16: 0x7fa367bb56da -
17: 0x7fa3676c688e -
18: 0x0 -
Thread 'Verifier #0' panicked at 'DB flush failed.: Custom { kind: Other, error: StringError("Corruption: block checksum mismatch: expected 3848331410, got 863857200 in /mnt/HC_Volume_2853858/eth/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/073385.sst offset 50078209 size 8611") }', src/libcore/result.rs:999
This is a bug. Please report it at:
https://github.com/paritytech/parity-ethereum/issues/new
2019-07-23 14:46:06 Finishing work, please wait...
2019-07-23 14:46:06 DB corrupted: Corruption: block checksum mismatch: expected 3848331410, got 863857200 in /mnt/HC_Volume_2853858/eth/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/073385.sst offset 50078209 size 8611. Repair will be triggered on next restart
2019-07-23 14:46:06 unable to get mut ref for engine for shutdown.
2019-07-23 14:46:06 DB corrupted: Corruption: block checksum mismatch: expected 3848331410, got 863857200 in /mnt/HC_Volume_2853858/eth/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/073385.sst offset 50078209 size 8611. Repair will be triggered on next restart
eth.service: Main process exited, code=exited, status=1/FAILURE

The last line comes from systemd to indicate that process exited.

After restart:

Loading config file from /etc/parity/eth.conf
2019-07-23 14:46:17 Starting Parity-Ethereum/v2.5.5-stable-3ebc769-20190708/x86_64-linux-gnu/rustc1.36.0
2019-07-23 14:46:17 Keys path /mnt/HC_Volume_2853858/eth/keys/ethereum
2019-07-23 14:46:17 DB path /mnt/HC_Volume_2853858/eth/chains/ethereum/db/906a34e69aec8c0d
2019-07-23 14:46:17 State DB configuration: fast
2019-07-23 14:46:17 Operating mode: active
2019-07-23 14:46:17 DB has been previously marked as corrupted, attempting repair
2019-07-23 14:59:44 DB corrupted: Invalid argument: You have to open all column families. Column families not opened: col6, col2, col0, col3, col1, col5, col4, attempting repair
Failed to open database Custom { kind: Other, error: StringError("Received null column family handle from DB.") }
eth.service: Main process exited, code=exited, status=1/FAILURE

This is going in infinite loop of restarts. The only solution is to kill db.

And now most interesting part. We first encountered this problem half year ago but at the beginning we were thinking that it is some hardware read/write error. Since then we try to find out the reason behind the problem. We were able to reproduce this on several different virtual machines from Hetzner (hetzner cloud) with different set of resources (from 2gb to 8gb RAM, 1-4 cores, always RAID SSD disks and ECC memory) in different data centers. Last month with v2.4.8 the problem was so frequent that it took longer to synchronize node then to corrupt db (12-18h), but this can be coincidence. What is sure, that issue is getting more frequent with time.

We are also running Ethereum Classic nodes, sometimes alongside ETH node, sometimes on different machine. The same cloud provider. The issue never (over at least one year, probably longer) happened on ETC chain. During the same timespan ETH db was corrupted at least 30 times.

One idea was that db is being corrupted during restarts. This is not the case, this time machine was not restarted for 2 weeks, from initial sync. It seams that when running with --no-ancient-blocks it takes longer to corrupt the db. We also tried to skim our config file to remove additional noise.

We are aware of #7748 , #8766 , #9867 , #9019 and #8583 . All of them were identified as hardware related. This is still possible answer, but we put a lot of effort to minimize hardware impact (we run all tests, asked data center operator to run tests on their infrastructure, switched machines, data centers and resources). Off course we also have parity nodes that are running for many months without problems, the problem the problem mainly concerns low resource environment (<16GB RAM, <4 cores).
If this is hardware problem, why DB repair always fails. In such case it is often one bit parity error in memory (ECC should fix that, but maybe cloud provider is doing something wrong). Maybe Parity should do checksums on memory and DB to avoid/be able to recover from such errors. This issue is getting more and mode problematic, as it is getting harder and harder to do full sync from scratch and cloud hosting of Ethereum nodes is very popular.

I attach logs (from beginning, initial sync), used configuration file. If this could help I can also provide SSH access to this machine (it is created only for testing purposes). This one was running with low RAM machine with swap and with only 1250MB cache. But we were able to reproduce it on 8GB RAM and without any swap.

Logs:
parity.log
Config:
parity.txt

jam10o-new · 2019-07-24T13:02:54Z

Thanks for the extremely detailed report!

gituser · 2019-07-25T22:19:25Z

This might be related to this issue: #10893

jam10o-new added F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. M4-io 💾 Interaction with filesystem/databases. labels Jul 24, 2019

jam10o-new added this to the 2.7 milestone Jul 24, 2019

vorot93 closed this as completed Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB corrupted: Corruption: block checksum mismatch #10915

DB corrupted: Corruption: block checksum mismatch #10915

zet-tech commented Jul 24, 2019 •

edited

Loading

jam10o-new commented Jul 24, 2019

gituser commented Jul 25, 2019

DB corrupted: Corruption: block checksum mismatch #10915

DB corrupted: Corruption: block checksum mismatch #10915

Comments

zet-tech commented Jul 24, 2019 • edited Loading

We encounter following issue: 2019-07-23 14:46:06 DB corrupted: Corruption: block checksum mismatch: expected 3848331410, got 863857200 in /mnt/HC_Volume_2853858/eth/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/073385.sst offset 50078209 size 8611. Repair will be triggered on next restart

jam10o-new commented Jul 24, 2019

gituser commented Jul 25, 2019

zet-tech commented Jul 24, 2019 •

edited

Loading

We encounter following issue:
2019-07-23 14:46:06 DB corrupted: Corruption: block checksum mismatch: expected 3848331410, got 863857200 in /mnt/HC_Volume_2853858/eth/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/073385.sst offset 50078209 size 8611. Repair will be triggered on next restart