Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Ancient block sync stalls #7008

Closed
ethernian opened this issue Nov 9, 2017 · 19 comments · Fixed by #9531
Closed

Ancient block sync stalls #7008

ethernian opened this issue Nov 9, 2017 · 19 comments · Fixed by #9531
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible.
Milestone

Comments

@ethernian
Copy link

I'm running:

  • Parity version: Parity/v1.8.2-beta-1b6588c-20171025/x86_64-linux-gnu/rustc1.21.0
  • Operating system: Linux
  • And installed: sudo apt-get install ./parity_1.8.2_amd64.deb

configuration detail: chain is simlinked to another drive

human@EtherBox:~$ ls -lisa ./.local/share/io.parity.ethereum/chains
524382 0 lrwxrwxrwx 1 human human 28 Sep  3 21:25 ./.local/share/io.parity.ethereum/chains -> /media/human/CHAIN_DB/chains

After warp sync, I'm unable to get all missing blocks fetched. After new start Parity forgets the already fetched blocks and start downloading them again and again.

Here is a sample of 3 parity runs

image

Here is the log file from above (as text file)
parity-3x-sync.txt


@Office-Julia Office-Julia added the Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. label Nov 9, 2017
@roninkaizen
Copy link

roninkaizen commented Nov 10, 2017

normal behaviour/ misunderstanding-
as you looked at the screen you posted, it became obvious-
confirmed = yes
unwanted, unintended=no
normal behavior, other counting on the transfers,
also having that, being fully sync on that nodes,
irritiating=yes
annoyance=no

would please somebody of parity document(declare) this behavior,
then it is known to be normal.
thx

@5chdn 5chdn added M4-core ⛓ Core client code / Rust. Z1-question 🙋‍♀️ Issue is a question. Closer should answer. and removed Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. labels Nov 10, 2017
@5chdn
Copy link
Contributor

5chdn commented Nov 10, 2017

That's the ancient block download.

Warp-sync downloads the latest snapshot and the last 30k blocks. After that, it starts downloading the full blockchain (yellow numbers).

@5chdn 5chdn closed this as completed Nov 10, 2017
@ethernian
Copy link
Author

ethernian commented Nov 10, 2017

@5chdn @roninkaizen
What is the reason to download ancient blocks again and again? This is the issue reported.

Please check the block numbers:
1st run: downloaded ancient blocks from #3574411 to #3614671
2st run: downloaded ancient blocks from #3573015 to #3576698
3st run: downloaded ancient blocks from #3573269 to #3575935

ancient blocks being downloaded in the same range again und again.
What is the reason to do so, if it is not a bug?

@arkpar arkpar reopened this Nov 11, 2017
@arkpar
Copy link
Collaborator

arkpar commented Nov 11, 2017

This is really weird indeed. Never seen this before. Could you restart with -l sync=trace and post logs?

@ethernian
Copy link
Author

yes, please:
Here 4 subsequent runs, two last are with -l sync=trace.
grafik

Trace-Logs:
parity-trace.log.zip

@5chdn 5chdn added this to the 1.9 milestone Nov 13, 2017
@5chdn 5chdn added F2-bug 🐞 The client fails to follow expected behavior. P5-sometimesoon 🌲 Issue is worth doing soon. P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible. and removed Z1-question 🙋‍♀️ Issue is a question. Closer should answer. P5-sometimesoon 🌲 Issue is worth doing soon. labels Nov 13, 2017
@5chdn 5chdn changed the title Warp sync: after restart parity forgets already fetched blocks Ancient block sync stalls Nov 14, 2017
@5chdn
Copy link
Contributor

5chdn commented Nov 14, 2017

Having similar issues with 1.8.2

@5chdn 5chdn mentioned this issue Nov 16, 2017
64 tasks
@5chdn 5chdn modified the milestone: 1.9 Dec 6, 2017
@5chdn 5chdn added P0-dropeverything 🌋 Everyone should address the issue now. and removed P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible. labels Jan 3, 2018
@5chdn
Copy link
Contributor

5chdn commented Jan 3, 2018

Can we also make sure we do not purge ancient blocks whenever a warpsync kicks in?
Edit: #6350

@tomusdrw
Copy link
Collaborator

tomusdrw commented Jan 5, 2018

As a workaround:

  1. Remove database
  2. Warp again to latest snapshot (hopefuly warp health improves soon with new bootnodes being rolled out)
  3. Download ancient blocks again.

Can we also make sure we do not purge ancient blocks whenever a warpsync kicks in?

@5chdn That's a separate issue, can you log it?

@tomusdrw tomusdrw added P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible. and removed P0-dropeverything 🌋 Everyone should address the issue now. labels Jan 5, 2018
@5chdn
Copy link
Contributor

5chdn commented Jan 5, 2018

@tomusdrw I think it is this one: #6350

@lght lght self-assigned this Jan 9, 2018
@lght
Copy link

lght commented Jan 11, 2018

As a workaround:

Remove database
Warp again to latest snapshot (hopefuly warp health improves soon with new bootnodes being rolled out)
Download ancient blocks again.

👍 for this workaround

With a 1.10.0 nightly build, warp took only ~40min from a fresh db!

Peer connections are still really volatile, just dropped to ~10 peers. Got to a max of 25 peers, but spending most time with 1-5 peers.

Will report back with sync status, currently at block 4880220. Warp synced to block 4880000

Update: confirmed, fully synced from scratch using warp and fast compaction!! Took ~30hrs from start-to-fully-synced.

@lght
Copy link

lght commented Jan 13, 2018

@arkpar is it possible that part of @dip239's issue is because of some corrupt block in the import round he was working on causing all blocks in that round to be rolled back on restart?

@5chdn
Copy link
Contributor

5chdn commented Jan 16, 2018

lght, this issue is about ancient block sync that happens after the warp sync.

@GoodMirek
Copy link

GoodMirek commented Jan 18, 2018

The issue Ancient block sync stalls has happened to me two times so far. Parity and Linux version strings follow:

version Parity/v1.10.0-unstable-25b19835e-20180117/x86_64-linux-gnu/rustc1.22.1

Linux wedos1 4.14.13-300.fc27.x86_64 #1 SMP Thu Jan 11 04:00:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Parity has been started with command:

parity daemon /home/ether/parity.pid --log-file /home/ether/parity.log -l info --cache-size 4096 --max-peers 10 --min-peers 5

It seems to happen once per ~1.5M blocks (1st occurrence at block 1639023, 2nd occurrence at block 2996831). At first occurrence, after the graceful restart (SIGHUP) the parity process continued to download ancient blocks from where it left off. There were parity process restarts in between the occurrences.

Second occurrence of the issue just happened and I am keeping the process running at the stalled state. Blockchain head keeps syncing. If you need any further info from the stalled process or need access to the system, let me know (there is no private info on the system). If I did not hear back within 3 days, I planned to restart the process with TRACE loglevel. I expect the issue could happen again before all blocks are downloaded.

While being in stale state, I took coredump of the process using gcore. Attached is gdb backtrace gdb_core_9187.txt. Log is at info level, so probably not useful. If you need the coredump file, I will share it (it contains no private info).

If that helps, I can rebuild parity in a different way, e.g. with symbols and without optimizations. I would appreciate a hint how to do that as I have zero rust knowledge.

@5chdn 5chdn modified the milestones: 1.9, 1.10 Jan 23, 2018
@5chdn 5chdn mentioned this issue Jan 26, 2018
46 tasks
@5chdn 5chdn modified the milestones: 1.10, 1.11 Mar 1, 2018
@5chdn 5chdn unassigned lght Mar 1, 2018
@5chdn
Copy link
Contributor

5chdn commented Apr 5, 2018

This just happened to one of my fresh 1.10.0 nodes if anyone wants to debug this.

@5chdn 5chdn modified the milestones: 1.11, 1.12 Apr 24, 2018
@folsen
Copy link
Contributor

folsen commented May 20, 2018

@ngotchac can you double-check that #8642 addresses this and if so close the issue please.

@GoodMirek
Copy link

GoodMirek commented May 20, 2018

I have tried twice with yesterday's commit 6552256.
The result was that it has never finished syncing the snapshot in one case (OpenVZ container) or it got stalled after syncing the snapshot successfully, but just before getting in sync with network, during processing of last blocks in the queue (KVM VM).

It might be attributed to compilation with rustc 1.26. Plus it seems there is a different memory allocation behavior while compiled with rustc 1.26 instead of 1.25. Either there is a memory leak or it cannot handle the situation of running in an OpenVZ container compared to running in KVM VM. Though running in OpenVZ container is probably non-interesting use case (I am using it because it comes with a good price), maybe it can help to know that the parity process in containerized node always dies after some time, what is not the case while compiled with 1.25. OpenVZ container reports correctly total amount of memory available, but not sure whether it is able to indicate memory allocation failure via ENOMEM under memory pressure or just invokes OOM killer.
I am running parity with this command:

~/parity/target/release/parity daemon ~/parity.pid --log-file ~/parity.log -l info --cache-size 2048 --cache-size-state 1024 --max-peers 200 --min-peers 25

in a container with 4GB RAM. Setting cache-size-state to at least 512 makes huge difference in performance in my case.

I am sorry I could not spend more time as this is just a hobby. Currently, I am trying to run both my nodes with vanilla 1.10.4 stable release binaries compiled with 1.25, downloaded from github.

@folsen
Copy link
Contributor

folsen commented May 20, 2018

@GoodMirek Thanks so much for this, it's really helpful, however I don't think it's related to this issue. I'd also say that 4GB RAM is probably not enough to run Parity with those caches, ram usage can depend a bit on peers as well although not a ton. My own node consumes about 10gb ram with similar cache sizes, though that is still a bit above expectation for various reasons. It would be a separate issue to try to investigate running parity under OpenVZ vs KVM VM, it should work, but I personally don't know the requirements on the VM side.

@GoodMirek
Copy link

@folsen The point is that on the same container with the same command line options a previous version built with rustc 1.25 ran just fine for more than a week. Though I do not remember which commit hash it was, but not more than a month old.
I know the previous build I ran suffered from this particular issue "Ancient block sync stalls", as I had to restart the process several times to complete download of old blocks.

@folsen
Copy link
Contributor

folsen commented May 21, 2018

@GoodMirek Interesting, please open a separate issue for the difference between 1.25 and 1.26, we definitely don't want regressions between rust versions.

@5chdn 5chdn modified the milestones: 2.0, 2.1 Jul 17, 2018
@5chdn 5chdn modified the milestones: 2.1, 2.2 Sep 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible.
Projects
None yet
9 participants