Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

stack overflow error when syncing a new node #4527

Closed
mikiquantum opened this issue Jan 3, 2020 · 6 comments
Closed

stack overflow error when syncing a new node #4527

mikiquantum opened this issue Jan 3, 2020 · 6 comments

Comments

@mikiquantum
Copy link
Contributor

We are facing this error when trying to fully sync up a new node with the network:

thread 'import-queue-worker-0' has overflowed its stack
fatal runtime error: stack overflow

Context:

  • We have 7 nodes validating blocks (fully synced and healthy)
  • We have 3 nodes that serve as sync nodes (full nodes, all of them lagging behind the last finalized and best block)

Steps:

  1. Start another node from scratch. We specify some bootnodes, in our test scenario we use one of the validator nodes to bootstrap from.
  2. Syncing starts. After a while the stack overflow error is triggered. Always around a different block number.

Host resources (cpu, mem) look fine, way below any suspicious threshold.

Only suspicious behavior we saw on the logs was this INFO msg:
Invalid justification provided by ...

Since the 3 sync full nodes were behind syncing with the network, we decided to bring all of them down and then try again syncing the new node, and it successfully caught up with no stack overflow panic. Although before we had to clear the new node DB from the previous attempt, otherwise the panic gets thrown as soon as the node process comes up.

Is it possible that the 3 full nodes DBs were corrupted and that triggered the stack overflow error in the new node when trying to build the local chain, since it was using the state of those DBs?

Any insight (theoretical or practical) on scenarios where this error could be triggered, would be highly appreciated it.

Thanks in advance!

@NikVolf
Copy link
Contributor

NikVolf commented Jan 3, 2020

Thanks for report!

What network are you trying to fully sync?

Can you reproduce "After a while the stack overflow error is triggered." now?

@mikiquantum
Copy link
Contributor Author

Hi @NikVolf

Sorry I should have clarify that, we are using our own network: https://telemetry.polkadot.io/#list/Flint%20Testnet%20CC1

I was able to reproduce fairly consistently until I restarted/refreshed those 3 full nodes they were behind. At this moment I cannot reproduce it, since looks like the network is in a healthy state now.
I might be able to reproduce it in the upcoming days, since I took a snapshot of one of those DBs.

Any hints on how this could have been triggered?

@NikVolf
Copy link
Contributor

NikVolf commented Jan 3, 2020

@mikiquantum If you will have another moment when it get in the state of being consistently reproduced, please try to run node with -l sync=trace argument until it gets reproduced

Meanwhile, we will try to figure out

If you have any custom code for your network and can share it, this will help too

@mikiquantum
Copy link
Contributor Author

Sure thing, I will enable that flag as soon as I can reproduce again.

Here is the custom code for chainSpec and runtime lib:

Thank you!

@bkchr
Copy link
Member

bkchr commented May 12, 2020

Any update on this?

@mikiquantum
Copy link
Contributor Author

@bkchr I was not able to reproduce again, I think we can close it for now and reopen it if we can trigger again in the future.
Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants