-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Khala Node get stuck at a block height and Pherry fails to connect to node #1070
Comments
UPDATE New behaviour: NODE1's Phala processes stop all operations and get freezed all at the same time. No visible errors on Node, PRuntime or Headers Cache, only Pherry says "exited with error: Rpc error: Request timeout" and "Restarting..." but does not restart, all containers keep freezed and also prometheus services (9615 and 9616) stop responding.. Attached logs from all containers for this session (I have had to supress some intermediate lines from PRuntime log to fit the github files max size of 25MB). Docker status for all containers once they get freezed: root@khalanuc ~ # docker inspect -f '{{json .State}}' "phala-node"
{"Status":"running","Running":true,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":47521,"ExitCode":0,"Error":"","StartedAt":"2022-12-12T10:31:20.754943891Z","FinishedAt":"0001-01-01T00:00:00Z"}
root@khalanuc ~ # docker inspect -f '{{json .State}}' "phala-pherry"
{"Status":"running","Running":true,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":48161,"ExitCode":0,"Error":"","StartedAt":"2022-12-12T10:32:08.527511555Z","FinishedAt":"2022-12-12T10:32:07.550963243Z"}
root@khalanuc ~ # docker inspect -f '{{json .State}}' "phala-pruntime"
{"Status":"running","Running":true,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":47353,"ExitCode":0,"Error":"","StartedAt":"2022-12-12T10:31:20.481902779Z","FinishedAt":"0001-01-01T00:00:00Z"}
root@khalanuc ~ # docker inspect -f '{{json .State}}' "phala-headers-cache"
{"Status":"running","Running":true,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":47295,"ExitCode":0,"Error":"","StartedAt":"2022-12-12T10:31:20.613639649Z","FinishedAt":"0001-01-01T00:00:00Z"}
After restart all processes I have seen this error repeated 64 times in Node logs and these "warnings", KSM chain works normally but the Node cannot import Khala blocks, and Pherry gives Network connection errors again:
And Pherry reports:
I needed to restart few times (5 times) to get Node working right again, importing KSM blocks and Khala blocks. All is working fine for 3 or 4 minutes, and then all containers get freezed again. No matter how many times I restart the processes, there is no way for the node to import blocks from the khala network continuously. |
Sorry for late response, do you using puruned node? |
Hi. Yes, I'm using PRUNE mode in all my nodes. |
Unfortunately, Substrate limition, when you're using PRUNE mode with So you have to have 32G memory or switch to |
Excuse me but, how is this possible? I have had this miner running around 6 month without any issue, I have other Intel NUC (same model) with 8GB RAM (+8GB swap), one Dell Optiplex with 16GB RAM (+8GB swap) and a custom miner with 12GB RAM (+8GB swap) all of them running smoothly with solo mining and Prune mode, and now this needs 32GB RAM? I met a lot of people with this Intel NUC model running solo mining with 8GB RAM. Sorry, maybe I missunderstood your answer. |
Sorry I may not explain that well, here's upstream issue paritytech/substrate#11911 and the PR paritytech/substrate#11980 |
OK. If I understand well, this is the cause of the "freezing", right? Because even starting from scratch, without databases, the khala network blocks are not incremented, the finished blocks are always at zero, while the KSM database synchronizes correctly, and I don't understand why this happens. Honestly, I want to know the root cause of the node problems to sync blocks (to fix it and get the node running) which is the main issue. That is the reason why I mentioned in the opening of the issue that I have tried everything, from scratch, with snapshot, copying the databases of a node that works... but the finalized blocks of the khala network do not ever increase and pherry finally gives connection error to node. I have been for almost a year with 5 mining-only nodes in my network, with no problems (beyond one that corrupted the KSM DB and I ended up removing it) and I have never had this problem where one of the nodes is not able to finish Khala blocks. I've checked everything I can (network elements, SSD disk, BIOS and OS configuration) and I can't find the reason for this to happen on a node that has already been running correctly for months in prune mode until 3 weeks ago. If I reconfigure the node to run in full mode, should it work correctly? |
This will cause unexpect memory usage, when exhaust total memory, it will OOM and killed by the OS. The memory usage depends on how many blocks to keep, for solo-script, we keep |
Hi jasl. I have tried with --pruning 1000, but it does not solve the issue (node is still stuck at a block height in khala, it does not finalize khala blocks and Pherry continues restarting cos cannot connect to node). |
I have tried from scratch with with FULL mode, and the same issue remains, after 30 minutes syncing, finalized blocks on khala network #0, but KSM finalize blocks fine. |
https://ksm.polkashots.io/ you can use this snapshot for fast restore Kusama part |
Ok. I started from scratch again: check BIOS settings again, new SO install (Ubuntu 20.04 with 5.4.0-135-generic), installing phala solo mining tools, and copying databases from a running node.
I have another miner syncing pherry, installed at the same time that this (with same OS version, kernel version, phala solo mining tools version, etc.). Both nodes have received a database copy from the same running node. But this one fails to import khala blocks and the other is running fine (importing KSM and Khala blocks normally). Both are behind NAT on different ports (the one running fine on 30333/30334 and this on 31333/31334) and I have another two miners running fine from weeks ago behind NAT (32333/32334 & 33333/33334) as well. I'm trying to understand what happens to this miner that is no able to import khala blocks, stops at a khala block height and pherry disconnects. I hope I have explained clearly |
hmmm that's werid... |
Yep, it looks good except that there is no line like this:
Same time period log from the other node with same hardware and installation:
I mean, it's not able to import khala blocks. At first I thought it might be a network problem, but both nodes (the two that are the same) are directly connected to the fiber router, and I even changed the network cable and also changed the port on the router just in case, but to no avail. |
Do you see any error? I found an issue that the node won't import blocks again Recently, I build several khala nodes using internal beta version of Khala-node, I don't see the problem again (I don't know what upstream change fix this or just my luck), |
Yes mate, for sure, I'll try whatever you think is convenient, no problem. The only one error I have seen in this node is the one I submited in my first post:
|
Ok I'll make a new version for you, it needs some hours, I'll reply when its ready those panic are OK, it's runtime error not affect sync |
Look at this: If I restart all Phala processes, it starts running fine again:
But I guess, as usual, it will stop working properly in a little less than 1 hour. |
This is weird to me, I can't imagine what happen or which part may have trouble, sorry |
No worries mate, I can image it's a weird issue, cos I have installed dozens of nodes and I have never seen a behaviour like this. Well, the node has failed again to import khala blocks and pherry was again disconnected. I have restarted the processes and all is working fine again. So at least I've managed to keep the containers from freezing XD |
Try
|
ok I'll try when I get home |
Well, it runs right now with your compiled image. It's running fine at the moment (it has reported some errors, you can see the log) but I going to wait a couple of hours and will let you know, cos it usually works fine the first 30 or 40 minutes after the restart.
|
Bad luck mate, It has failed a few minutes after the start 😓. Here you are the session logs. |
After having failed, I can't start the node anymore, every time I try to start it, it panics.
|
hmmmm.... I think I need to make a patch tomorrow So your node can't work even you restart few times? |
Yes, after 6 or 7 restarts It works again. I have set Up a crontab every 15 minutes to check if Pherry is connected to node and restarts processes if not, and It writes a log to know how many times the processes need to be restarted tonight |
Hi. 3 restarts during the night, fairly close together:
|
|
Hi jasl. In case it helps you to find the cause of this strange behavior: I see that as Pherry has more blocks synchronized, the connection to the node fails less and less. In addition, the restarts that have been triggered in the last few hours have all been in a short period of time, and the node has been able to work for 7 or 8 hours straight, then having to be restarted 3 or 4 times in one hour (with the cron task every 15 minutes) and then spend another 5 or 6 hours working correctly. However, at the beginning of the Pherry sync (first 200k or 300k blocks) it needed to be restarted at least 1 time every hour or more. |
Can you try
this one should fix |
Ok mate, the node is running with this image. I will let you know any change in the node behaviour. The first thing I can observe is that the number of peers on the KSM network is very unstable, constantly changing from 10 to 40, but not stabilizing at 40 (as would be normal after several minutes of operation). I guess it is something specific to this compilation you have passed me, isn't it? |
Sorry for this offtopic question, have you planned a SGX SDK upgrade in the next pruntime image release to solve the INTEL-SA-00657? |
Unfortunately no, actually we have a new SDK version, but it will break seal decryption which means you will lost your workers info that have to rebuild them. so we decide to stick on old SGX SDK for PRuntime v0 We have considered this situation before, so in our next major PRuntime (release early next year) we have a new handover process. but this feature is hard to backport to V0 |
I'm sorry I can't answer you this... it mixing many possible issues, peers count is a block box for me, and even Parity's developers |
Ok, I can understand this decission. BTW, the Pherry service keeps disconnecting and still needs a processes restart to work. Occurs every 3 or 4 hours approx. |
then I have no idea now... |
Hi jasl. I have a new error on Pherry process:
now Pherry sync process is stuck at this point and cannot continue even though I restart it |
Hi. Yesterday I had above error on one node, and the other finished pherry sync fine, but now the same error is in both nodes. |
This one I shall prepare a new pherry to fix this (today), I know what is this |
Ok mate. Let me know when it's ready to test It on my nodes. |
Try this phalanetwork/phala-dev-pherry:22122001 |
Update: |
Hi jasl. Well, one of my nodes is now synced and Pherry is importing new block without problem. The other node has this error, and I wonder if it's a database issue or it's related to the images I'm using.
These are the images the node is using:
|
I haven't seen |
Yes, that's the first thing I tried before writing here, but all the headers were updated and the error persists. |
Any idea on how can I fix this? |
Sorry really don't have idea... I forward the issue to our team, hope my colleagues could know... |
Ok, thanks. I have tried to copy the databases from another node that is running, but the issue is still there. |
I still suspect the headers-cache not freshed... but I really haven't seen the error before (if you met Parhaps you can use other node directly? |
when you say "use other node" do you mean copy headers-cache data from any other running node? |
if you have many workers, your Pherry can connect a remote khala-node and headers-cache... but I guess solo-mining not support this here's a sample which I'm using
|
Good point. I have always wondered if several solo-miners could all connect to a single remote "phala-node" or is it necessary to use PRB for that. Ok, I will try this connecting this Pherry to a remote (and running fine) node+headers-cache and will let you know. Thanks! |
Oh man! I can't believe it, at last the two nodes are synchronized and ready to be registered on the network! You idea worked like a charm, thank you so much for all your time and effort, jasl! I'm going to close the issue and hope I don't have to bother you again for a long time! |
I have two nodes, both with same hardware (Intel NUC7PJYH2) and both of them suffer the same behaviour.
Both nodes get stuck at a block height (Khala network, KSM syncs well) and Pherry restarts because it fails to connect to node. It's very random, cos sometimes I restart the processes and all works fine during several hours (12h o 14h) but sometimes I need to restart processes 2, 3 or 4 times until I get all working fine. I have tried reinstalling both nodes from scratch, using databases synced from zero, with and without KSM snapshot and even copying databases from other nodes that are running fine.
NODE1: It has been running smoothly for months, but after a coolingdown period, begins to show the issue.
NODE2: It hasn't been running ever, it's a brand new machine installed for the first time.
I have attached a file with logs (from processes start until the fail) and system info from both nodes. Please, don't hesitate to ask for any further info you need.
nodes_info_and_logs.tar.gz
The text was updated successfully, but these errors were encountered: