Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory used increasing slowly #17450

Closed
marcosmartinez7 opened this issue Aug 20, 2018 · 32 comments
Closed

Memory used increasing slowly #17450

marcosmartinez7 opened this issue Aug 20, 2018 · 32 comments
Assignees

Comments

@marcosmartinez7
Copy link

System information

Geth version: 1.8.12 stable
OS & Version: Linux 16.04

Expected behaviour

Memory usage stays constant

Actual behaviour

Hi @karalabe,

im running the node without the rpcapis, the node started 3 days ago using 1.9% of my RAM (8gb). Now is consuming 2.3% and it keeps increasing slowly (like 10mb / h).

I ran the node without specifing the --cache flag, so i assume it is using 1gb.

Is this something that i must worry about or maybe is related with the garbage collection?

Steps to reproduce the behaviour

I ran the node with this command:

geth --datadir e1/ --syncmode 'full' --port 30357 --rpc --rpcport 8545 --rpccorsdomain '*' --rpcaddr 'server_ip' --ws --wsaddr "server_ip" --wsorigins "some_ip" --wsport 9583 --wsapi 'db,eth,net,web3,txpool,miner' --networkid 21 --gasprice '1'

@karalabe
Copy link
Member

Opening up the HTTP/WebSocket interfaces to outside traffic is dangerous because people are actively trying to break into nodes. Are you sure you need remote access to your node via HTTP? That should only be used if you're behind a firewall and can control access. Couldn't you use SSH + IPC to attack to a remote node?

With regard to memory use and RPC, what requests are you making? I can imagine that there might be some leak in our code, but providing some details about your usage could be invaluable to track it down.

@marcosmartinez7
Copy link
Author

marcosmartinez7 commented Aug 20, 2018

Hi, yes, but i need to hit the smart contract from any origin. Is there any way to accomplish that without opening up the HTTP interface? is a private network that must be accessed from any origin (metamask, myetherwallet, any etherem wallet..)

About memory, the memory usage increase with every transaction submitted. I thought that it might be related to the garbage collector, maybe isnt executing... i have send 100 transactions to be submitted and that increased my ram usage about 50MB

Is there any information that i can provide in order to identify the possible leak?

I think that the problem might be related with the fact that the RPC port is opened and anyone could be making something that reserves memory... the node started with:

geth --datadir e1/ --syncmode 'full' --port 30357 --rpc --rpcport 8545 --rpccorsdomain '*' --rpcaddr 'server_ip' --ws --wsaddr "server_ip" --wsorigins "some_ip" --wsport 9583 --wsapi 'db,eth,net,web3,txpool,miner' --networkid 21 --gasprice '1'

so there is no exposed rpcapi, only ws but from some specific IP

any ideas of how i can troubleshoot this? it is only increasing when i send transactions to the blockchain using metamask..

@marcosmartinez7
Copy link
Author

marcosmartinez7 commented Aug 20, 2018

Recently i get attached to a console and send 400 transactions (hitting a Smart contract) in a for loop.

The memory used increased 400mb

any idea of what can i research in order to check what is causing this?

After 20 - 30 minutes it goes back to the previous ram.. or some MB more.. is this normal? the tendency is that the time passes and the memory used increases when transactions are submitted..

@marcosmartinez7
Copy link
Author

marcosmartinez7 commented Aug 21, 2018

I have tried on another chain that doesnt expose RPC endpoint, sending 5000 transactions from geth console.

Memory started at 1140 MB and after 5000 transactions grows to 1550 MB, so geth takes 400 MB in order to process those transactions.

Since the block time is 15 seconds, it will take a while to confirm those transactions, so .. it is normal to stay into 1550MB for a while ? also.. the cache memory is still increasing..

is there anything that i can share with us in order to check if this behavior is ok? it seems like after 10k transactions the cache used by Geth growed 400mb and the used memory also increased.. the values are not constant after sending a batch of transactions, maybe is standard for geth to consume more memory the more heavy the chain is

Also, without RPC interactions the cache used increases.

@cdljsj
Copy link

cdljsj commented Jan 7, 2019

The problem gets worse in 1.8.20, I have to restart geth node every half day due to the high memory usage.

@hapsody
Copy link

hapsody commented Jan 8, 2019

Same problem here
I used --cache flag (--cache "64") but problem is still occured

Version: 1.8.16-stable
Git Commit: 477eb09
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.11
Operating System: linux
GOPATH=
GOROOT=/home/travis/.gimme/versions/go1.11.linux.amd64

top - 00:58:06 up 21:28, 3 users, load average: 2.39, 2.18, 1.28
Tasks: 129 total, 1 running, 128 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.2 sy, 0.0 ni, 19.6 id, 79.8 wa, 0.0 hi, 0.0 si, 0.5 st
KiB Mem : 3985340 total, 116412 free, 3826964 used, 41964 buff/cache
KiB Swap: 7812496 total, 5197612 free, 2614884 used. 7448 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3811 ubuntu 20 0 7055996 3.570g 4120 D 2.0 93.9 2:07.48 geth

@hadv
Copy link
Contributor

hadv commented Jan 26, 2019

I have the same problem. It seems there're memory leak somewhere. And it seems that if node is mining and open the rpc endpoint have no problem.

If the node is not mining then the memory increase regularly as @marcosmartinez7 report when sending thousand of tnx continuously via rpc endpoint.

@marcosmartinez7
Copy link
Author

marcosmartinez7 commented Jan 27, 2019

On 1.8.18 version i havent experimented anymore this problem, i mean, the memory is increasing but it reachs a stable max value. T

ake in care that geth has in memory some information that is written to disk on each epoch, and an epoch is about 20-30k blocks.

So if youre using less than 4GB of RAM i think this can happen.

@hadv
Copy link
Contributor

hadv commented Jan 28, 2019

Thank you for your information @marcosmartinez7

By the way, I still can reproduce the issue on 1.8.20 & 1.8.21 with 16GB of RAM by sending 10 thousands of tnx (2000 txs/block sealing & txpool always fill up with ~10,000 txs)

And one thing very strange is mining node with public RPC endpoint don't have the issue on the same computing configuration. So I think there's must be memory leak somewhere.

mining
image

un-mining (memory leak)

image

@hadv
Copy link
Contributor

hadv commented Jan 29, 2019

cc: @karalabe @fjl
goroutine leak on unmining RPC node

image

mining node (no leak)

image

@hadv
Copy link
Contributor

hadv commented Jan 29, 2019

almost the goroutine leak is calling feed.Send() as below

goroutine profile: total 49352
48057 @ 0x44384b 0x4438f3 0x41b15e 0x41ae4b 0x6edc6f 0x472051
#	0x6edc6e	github.com/ethereum/go-ethereum/event.(*Feed).Send+0x12e	/home/admin/gonex/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133

@hadv
Copy link
Contributor

hadv commented Jan 29, 2019

the goroutine leak might be the case of feed.Send() blocking issue reported on #18021

@fjl fjl self-assigned this Jan 29, 2019
@fjl
Copy link
Contributor

fjl commented Jan 29, 2019

The issue is not that Feed.Send() is blocking, it's that the send to the feed happens in a background goroutine. Please provide a longer stack trace so we can see which part of the system is trying to send on the feed.

@hadv
Copy link
Contributor

hadv commented Jan 30, 2019

The issue is not that Feed.Send() is blocking, it's that the send to the feed happens in a background goroutine.

I think that the issue happen as @liuzhijun23 figure out in the #18021 when the for loop is blocked in feed.Send() then another call will be stuck at line 133 <-f.sendLock because at which the f.sendLock is empty

Please provide a longer stack trace so we can see which part of the system is trying to send on the feed.

it's almost from the txpool

goroutine 1267548 [chan receive, 1 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc0082560b0, 0xe93bc0, 0xc015639a20, 0xc0706cf0c8)
	/home/admin/gonex/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0x12f
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteExecutables
	/home/admin/gonex/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:1098 +0x17f8

@hadv
Copy link
Contributor

hadv commented Jan 30, 2019

@fjl a clue: on no-mining node, the TrySend() return false very frequency then make more and more goroutine leak overtime. On the mining node, TrySend() always succeed

			if cases[i].Chan.TrySend(rvalue) {
				nsent++
				cases = cases.deactivate(i)
				i--
			} 

@holiman
Copy link
Contributor

holiman commented Jan 30, 2019

@hadv can you produce a full trace and upload somewhere? There's a debug_stacks/debug.stacks() method to dump out the trace. It will output it to stdout, so if you do geth .... console > stacks.txt and then later execute debug.stacks() it should be caught to file.

That should show what particular receiver is bottlenecking the events

OBS: if you do > stacks.txt you won't see the actual console, but it'll still work if you type in the command

@hadv
Copy link
Contributor

hadv commented Jan 30, 2019

@holiman
Copy link
Contributor

holiman commented Jan 30, 2019

You've got 203 threads stuck on

1: semacquire [Created by http.(*Server).Serve @ server.go:2851]
    sync       sema.go:71                      runtime_SemacquireMutex(*uint32(#1648), bool(#6018))
    sync       rwmutex.go:50                   (*RWMutex).RLock(*RWMutex(#1647))
    miner      worker.go:252                   (*worker).pending(#1646, 0, 0)
    miner      miner.go:155                    (*Miner).Pending(#1676, #28503, #129)
    eth        api_backend.go:92               (*EthAPIBackend).StateAndHeaderByNumber(#1920, #27, #16012, 0xfffffffffffffffe, #45, 0, 0, 0x5208)
    ethapi     api.go:700                      (*PublicBlockChainAPI).doCall(#1121, #27, #16012, #28616, #28747, #244, #22014, 0x551ae, 0, 0, ...)
    ethapi     api.go:791                      (*PublicBlockChainAPI).EstimateGas.func1(0x551ae, #26)
    ethapi     api.go:800                      (*PublicBlockChainAPI).EstimateGas(#1121, #27, #16012, #28616, #28747, #244, #22014, 0x551ae, 0, 0, ...)
    reflect    value.go:447                    Value.call(string(#2177, len=824635791808), []Value(0x13 len=16418845 cap=4), #2846, 0x3, 0x4, 0x0, #2846, ...)
    reflect    value.go:308                    Value.Call([]Value(#2177 len=824635791808 cap=19), #2846, 0x3, 0x4, 0x1, 0x1, 0x0)
    rpc        server.go:309                   (*Server).handle(#3676, #27, #16012, #37, #23615, #2845, #23616, 0, 0xe630e0)
    rpc        server.go:330                   (*Server).exec(#3676, #27, #16012, #37, #23615, #2845)
    rpc        server.go:192                   (*Server).serveRequest(#3676, #29, #27953, #37, #23615, 0xfb1801, 0x1, 0, 0)
    rpc        server.go:223                   (*Server).ServeSingleRequest(#3676, #29, #27953, #37, #23615, 0x1)
    rpc        http.go:257                     (*Server).ServeHTTP(#3676, #24, #12949, #14736)
    cors       cors.go:190                     (*Cors).Handler.func1(#24, #12949, #14736)
    http       server.go:1964                  HandlerFunc.ServeHTTP(ResponseWriter(#3660), *Request(#12949), #14736)
    rpc        http.go:324                     (*virtualHostHandler).ServeHTTP(#3661, #24, #12949, #14736)
    http       server.go:2741                  serverHandler.ServeHTTP(ResponseWriter(#3680), *Request(#12949), #14736)

If you are batch-adding thousands of transactions, and doing an estimategas on each and every one, it will be quite resource intensive for the node. They will each be competing for the lock to obtain a particular state.
I guess a more batch-friendly method could be used, that would take the pending state, and reuse it for every tx in the batch. But however we do it, it'll be messy -- e.g. do we want to apply transactions on top of eachother or reset the state again after each?

@hadv
Copy link
Contributor

hadv commented Jan 30, 2019

@holiman Can you please explain us why only not-mining node need to run below code? That's might be the reason why only not-mining node face the goroutine leak issue, right? Thank you!

		case ev := <-w.txsCh:
			// Apply transactions to the pending state if we're not mining.
			//
			// Note all transactions received may not be continuous with transactions
			// already included in the current mining block. These transactions will
			// be automatically eliminated.
			if !w.isRunning() && w.current != nil {
				w.mu.RLock()
				coinbase := w.coinbase
				w.mu.RUnlock()

				txs := make(map[common.Address]types.Transactions)
				for _, tx := range ev.Txs {
					acc, _ := types.Sender(w.current.signer, tx)
					txs[acc] = append(txs[acc], tx)
				}
				txset := types.NewTransactionsByPriceAndNonce(w.current.signer, txs)
				w.commitTransactions(txset, coinbase, nil)
				w.updateSnapshot()
			} else {

@holiman
Copy link
Contributor

holiman commented Jan 30, 2019

I don't know yet.. However, there appears to be ~10K routines spawned by promoteExecutables, that are waiting on the lock in feed.go. One is busy in the loop.

I think an underlying problem is that a better model for the transaction handling would be to use active objects (one thread/routine) which receives data, instead of each sender spawning it's own goroutine (https://github.com/ethereum/go-ethereum/blob/master/core/tx_pool.go#L1000) . Btw, @hadv , I don't know what code you're running, but the line numbers from your stack does not match up with what's on master now.

I'm not sure if there are any simple solutions to this ticket, since IMO it would probably require a non-trivial rewrite of tx pool internals.

@hadv
Copy link
Contributor

hadv commented Jan 30, 2019

okay, thank you for your information. For the code I'm adding some log to figure out the issue then the line of code might be difference with the master but the logic are the same though.

@bishaoqing
Copy link

@holiman Can you please explain us why only not-mining node need to run below code? That's might be the reason why only not-mining node face the goroutine leak issue, right? Thank you!

		case ev := <-w.txsCh:
			// Apply transactions to the pending state if we're not mining.
			//
			// Note all transactions received may not be continuous with transactions
			// already included in the current mining block. These transactions will
			// be automatically eliminated.
			if !w.isRunning() && w.current != nil {
				w.mu.RLock()
				coinbase := w.coinbase
				w.mu.RUnlock()

				txs := make(map[common.Address]types.Transactions)
				for _, tx := range ev.Txs {
					acc, _ := types.Sender(w.current.signer, tx)
					txs[acc] = append(txs[acc], tx)
				}
				txset := types.NewTransactionsByPriceAndNonce(w.current.signer, txs)
				w.commitTransactions(txset, coinbase, nil)
				w.updateSnapshot()
			} else {

it might want to refresh the pending state,so that the rpc-client can get the latest pending information,for example, nonce of account, and users don't have to maintain the information themselves

@hadv
Copy link
Contributor

hadv commented Jan 30, 2019

it might want to refresh the pending state,so that the rpc-client can get the latest pending information,for example, nonce of account, and users don't have to maintain the information themselves

that we can understand obviously but the mining and non-mining node use difference way to apply the pending tnx and the way of non-mining node make goroutine leak.

@fjl
Copy link
Contributor

fjl commented Jan 30, 2019

I think a simple fix to try would be removing the go on this line https://github.com/ethereum/go-ethereum/blob/master/core/tx_pool.go#L1000. There should be no downside to this (if I'm looking right) because promoteExecutables is called from two places: addTx and reset. Removing the goroutine would just mean that the callers of those methods need to wait until the events have been delivered.

@hadv
Copy link
Contributor

hadv commented Jan 31, 2019

Removing the goroutine would just mean that the callers of those methods need to wait until the events have been delivered.

Yeah, I'm afraid that in case Feed.Send() is blocked then whole txpool is blocked also

@hadv
Copy link
Contributor

hadv commented Feb 1, 2019

by the way, I think we have enough information for the issue then could you remove the label need:more-information? @fjl Thanks!

@0x234
Copy link

0x234 commented Feb 8, 2019

Here's some data from a geth running 1.8.22 over the past week: https://imgur.com/a/2LJpY8U

The first drop was when the node was upgraded from 1.8.21 to 1.8.22. The second one was from where the node was restarted with mining enabled.

@hadv
Copy link
Contributor

hadv commented Feb 8, 2019

by the way, I think we have enough information for the issue then could you remove the label need:more-information? @fjl Thanks!

@karalabe @holiman @fjl Will we have any update on this issue in short term?

@hadv
Copy link
Contributor

hadv commented Feb 8, 2019

@fjl any more information do you need for this issue?

@no-response
Copy link

no-response bot commented Feb 28, 2019

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further.

@no-response no-response bot closed this as completed Feb 28, 2019
@hadv
Copy link
Contributor

hadv commented Mar 1, 2019

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further.

@karalabe @fjl @holiman We already provide the details information. Please remove the inappropriate label and re-open this issue. Thank you!

@hadv
Copy link
Contributor

hadv commented Mar 1, 2019

open new issue #19192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants