-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Memory Consumption Tracking Issue (OOM) #4918
Comments
|
Following - this may be related to an issue we're also seeing - #4953 |
To keep people updated on the progress. We need to update lighthouse to the latest libp2p: We have message priority sorted: And we have a form of time-bound message dropping: We are going to combine these into a rust-libp2p fork and start testing on live networks. |
Further updates. We have a rust-libp2p fork which we are now testing. The new features we have added are:
We are adding these to Lighthouse and thoroughly testing ahead of a new lighthouse release. |
Great! Looking forward to seeing these in |
Are you happy to close this @AgeManning? If not, perhaps we can at least drop the |
This has been resolved in |
@AgeManning could you please share what were the test results? |
Sure. Sorry for the late reply. @diegomrsantos I think I've lost all our pretty graphs during all the analysis, but perhaps one of the others can chime in if they saved them or want to look back to previous data. Fundamentally, Lighthouse beacon nodes (depending on their peers) would OOM. The memory profile grew to the order of 16GB before "randomly" freeing it back to 2-4GB. The graphs we were looking at would show memory slowly growing up to these numbers then occasionally dropping back down. The drops were due to peers being disconnected and the massive send queues being freed. It occurred because slow peers would accumulate a huge queue of messages to be sent to them and bloat lighthouse's memory footprint. After our changes, the memory profile stays steady at a 2-4GB on all the nodes that ran the patch. We implemented a fancier queuing system inside gossipsub that prioritises important messages to be sent and if others are waiting too long to be sent, we simply drop them and don't bother sending them. So if a peer has a backlog, we remove older messages to make way for newer messages. Here are what the queues look like on a normal node with poor bandwidth, currently running on mainnet: The queues are now bounded and as you can see they never really get populated up to their bounds. I could also show you a memory profile of a current lighthouse node, but its fairly bland (which is a good thing) with no wild memory spikes. |
Description
We are aware of an issue on the mainnet network which is causing Lighthouse to consume more memory than it should. This is leading to Out of Memory (OOM) process terminations on some machines.
The root cause of the issue (we believe) are messages being queued on gossipsub to be sent out. This is a combination of messages being published, messages being forward and gossipsub control messages. The queues are filling up and the memory is not being dropped. This appears to only be occuring on mainnet, we assume in part to the size of the network and the number of messages being transmitted.
There are a number of solutions being put in place and being tested. This issue is mainly a tracking issue, so users can follow along with development updates as we correct this issue.
Primarly the end solution will consist of more efficient memory management (avoid duplicating any memory in messages when sending) this should reduce allocations, a priortisation of messages so that we can prioritise published, forward and control messages individually and finally a dropping mechanism that allows us to drop messages when the queues grow too large.
Memory Allocations:
Message Prioritisation:
ConnectionHandler
libp2p/rust-libp2p#4811The text was updated successfully, but these errors were encountered: