-
Notifications
You must be signed in to change notification settings - Fork 2.6k
--light sync panics with thread 'tokio-runtime-worker' has overflowed its stack
#5998
Comments
Can you try to run with gdb (or similiar) to get a proper stack trace? I tried it on my PC and could not yet reproduce it. |
macos has default stack size of 512k for non-main threads. So this should be reproducible on linux with something like |
Yep, getting it ;) |
(it is related to decoding the |
I'm getting this, on a full node when trying to sync for #7225. The command I'm running is simply |
So if the recursive nature of the |
Maybe there is a bug and we should already have pruned some of the state. Not sure. @andresilva probably knows more. |
We will prune BABE's epoch tree as we finalize new blocks, in this case we are not catching up to the latest finalized block and as we sync BABE blocks the epoch tree gets deeper and deeper. Since all of the operations on the fork tree are currently implemented using recursion we eventually reach the limits of the call stack. The solution is to rewrite the operations on the fork tree to not use recursion and instead use some auxiliary data structure as stack (my brain understands recursion better so that's why I wrote it like that the first time around). |
It should also be possible to write |
I just tried this and it had no effect, so this doesn't seem to be a problem with decoding. Here's the gdb backtrace:
|
@expenses you have rewritten the decode as iterative function? Perfect! Decode and import suffer from the same problem, so we need a solution for both of them ;) |
Unfortunately, |
|
I have written an iterative version of The node still panics though due to |
Thanks! Yeah I think we can keep the |
I've seen this a few times on I'm not running with My stack looks something like this:
|
@notlesh I assume that your chain is not finalizing blocks, that is the root cause of the issue you're seeing. |
This is running on Moonbeam's alphanet, which is using its own relay chain and running By the way, I saw the same thing again over the weekend (where I needed to purge my chaindata and resync). |
So a little more info about when we hit this in the moonbeam context. Moonbeam is a parachain that is built with cumulus. As such, it runs a polkadot full service to follow relay chain state. It is that polkadot service within the moonbeam process that is panicing in @notlesh 's trace above (correct me if I'm wrong, but that's where grandpa is running) |
Can you confirm that it is following finality from the relay chain? Please just post some of the node output here before the crash, I assume it isn't finalizing and hence why the GRANDPA pending changes tree is so deep (we only clean it up on finality). |
Here's some recent output from the original crash:
|
So like I said the node is not finalizing, the relay chain best finalized is stuck at #114261, I don't know if the network as a whole is finalizing or if it's just a local issue on that node. On top of that your epochs seem to be 1 minute long which just makes the problem worse. We should focus on why the node is not finalizing. We can make this case better by not using recursion in the fork-tree code (so that we don't overflow the stack if the tree is too deep) but that's just treating the symptoms and not the root cause. |
Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions. |
On a recent kusama, I'm running:
At some point, the sync panics with
Gist: https://gist.github.com/amaurymartiny/b5b85f88b3bf2e590409ac371acba288
The "at some point" depends, I tried running twice:
I'm not sure if it's related to #4527, if yes, we can close this issue and continue there.
Specs:
The text was updated successfully, but these errors were encountered: