-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geth in light mode hanging with 100% CPU #20464
Comments
CPU still at 100% but geth synced some blocks:
Still running, only ~5% CPU |
Sent some more SIGKILL until geth panicked and was restarted by systemd:
|
can you do debug.stacks? |
@ligi Please elaborate. Ok, found it:
|
awesome - let us know when you have the logs |
Chiming in just to say that I have the same problem after upgrading to 1.9.9. |
Same here.. roughly 10% of our nodes have cpu throttling. Also we have no Choice because it’s the only one with instanbul activated at the right height. We are running it in the context of another decentralized network so hopefully we can resolve quickly. I can get more data aswell to feed analysis. |
I've noticed same problem when there was strange peer in peers list: name: "Multigeth" When I got this peer in peers list my geth in light mode became very slow in performing requests. It uses CPU 100% and all that you described. Ma ybe it was another peer but with this dificulty certainly. Hope this information will help fix it. |
@degger80 interesting, seems to make sense... perhaps also related to clients that did not upgrade yet either because they would have forked by now. So if there are any peers that are not the latest they likely will be on different chains |
Same here! Removing the "Multigeth" node helps:
|
Can confirm. Removing the node with wrong difficulty results in geth syncing missing blocks and then dropping CPU @ligi |
Had the same issue on 2 nodes and removing Mutligeth nodes resolved the CPU issue 34.216.196.4:30303 MultiGeth/v1.9.5-stable-ad5e13d5/linux-amd64/go1.13.4 34.216.42.219:30303 MultiGeth/v1.9.5-stable-ad5e13d5/linux-amd64/go1.13.4 |
Are developers able to look at what those peers are doing by adding them and analyzing the code paths to shut the attack down? |
Thanks to everyone providing helpful information! Especially the workaround of removing MultiGeth can really help to investigate the issue (cc @zsfelfoldi ) |
Because anyone exploiting this bug can make all light clients unusable. |
sure - but this only makes it a potential attack - my question was if there he has information that there is an active attack |
Is it by design that geth in light mode connects to not synced nodes and keeps them in peer list? |
Perhaps someone is using multi geth to try to dos light nodes..because they are all on the same subnet somewhere on aws so likely a single provider or service or person. Anyways multi geth hasn’t been updated in a while and is out of date but perhaps someone is screwing around.. what I’d like to see is if we are connecting to other multi geth nodes successfully without issue as that would isolate it from attack to bug causing extraneous resource consumption which is still pretty bad itself but not an attack per se. |
How would it end up syncing peers if it cannot connect to non synced peers? In a mesh network every node needs to be able to provide for every other node. |
Another one of my nodes is now showing this issue |
same here. had to admin.removePeer this one:
|
My workaround so far (via cronjob every 5 minutes): #!/bin/sh
exec geth --exec 'admin.peers' attach \
| grep -P 'enode|MultiGeth' \
| grep -B1 -F MultiGeth \
| grep enode \
| cut -d '"' -f 2 \
| while read enode
do
geth --exec "admin.removePeer('$enode')" attach
done This might be done better via javascript but I don't know javascript enough. Interestingly,
|
Is there any progress on fixing this issue, every day i am having to remove peers for some of my Nodes |
I've been trying to repro this, but whenever I 'meet' a multigeth-node, the peer is dropped almost instantly |
|
maybe they have to connect to you, as inbound instead of outbound |
Same problem still... This are the issues quantities for each IP The remove-peer script is doing the job so far... |
Our current hunch is that the light client fetcher is a bit broken when a peer announces a valid very very long sidechain. |
We should work toward #19710 to fix this issue. |
Another idea is adding the checkpoint challenge from #20125. |
Any update on this ongoing issue? |
Any update on this ongoing issue? |
@johnp1954 We suspect there are some issues in Check the PR here #20692 |
Thankyou |
I just noticed that there is a new type of node that is causing high cpu usage when connected. The name is "CoreGeth". It also has the low difficulty similar to the MultiGeth client. Seems to be an ETC node client as well. So I changed my javascript code to only check for low difficulty and not even bother with the name of the client. (see my old code above). |
@turbo-boost any possible to try out the fix #20692? The PR is waiting the review from @zsfelfoldi . I have a strong feeling that the issue is caused by old |
|
@turbo-boost Ah, If you are not familiar with Go, probably it takes some time.
|
I have seen this too, if you look it appears to be the same node as Multi Geth, just a different name |
enode has all the same values and same IP |
@rjl493456442 I managed to get your PR compiled, and have been running it for about 36 hours. During about 4-5 of those hours, I had CoreGeth (enode://7adc5369b5b40...) connected to my node, and I observed no ill effects. CPU usage was normal. I'll keep running it and let you know if I see any issues. |
@turbo-boost Cool, thanks for trying it. |
Hello everyone, I am using a light client and still seeing 100% CPU utilization by Geth. How I can solve this. Thanks |
@princesinha19 Which version are you using? I had a patch for it(unfortunately it's not merged yet), can you please try it?
|
@rjl493456442 Thanks, Gary. I am using Geth v1.9.14, which is the latest I think. I will use the suggested solution by you and, will update. Thanks. |
Hi, we just merged the light fetcher rewrite PR. Would close it right now but feel free to reopen or open another issue if it happens again. |
System information
OS: Ubuntu Xenial 16.04 LTS
System: AWS ec2 t3.small (2GB RAM, 2 Cores)
Expected behaviour
Geth using little CPU and syncing blocks
Actual behaviour
Geth using 100% CPU and not syncing blocks, geth being stuck at block
9120454
.Steps to reproduce the behaviour
Run geth for 11 days:
RPC
geth attach
is not working, rpc is.This is a script I run to see block height between different hosts:
Logs
journalctl -u geth
geth.logCacti
AWS Metrics
Moment CPU went up
Logfile from around that time:
The text was updated successfully, but these errors were encountered: