Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I find a bug in event/feed.go #18021

Closed
liuzhijun23 opened this issue Nov 2, 2018 · 23 comments
Closed

I find a bug in event/feed.go #18021

liuzhijun23 opened this issue Nov 2, 2018 · 23 comments
Assignees

Comments

@liuzhijun23
Copy link

I read the resource code of event/feed.go in go-ethereum Version 1.8.17,I find there is a bug in Send function。Let me see the code about sendLock:
`func (f *Feed) Send(value interface{}) (nsent int) {
.........
f.once.Do(f.init)
//1. read sendLock chan
<-f.sendLock

   ............
   if !f.typecheck(rvalue.Type()) {
           //2. there is a panic,write sendLock chan
	f.sendLock <- struct{}{}
	panic(feedTypeError{op: "Send", got: rvalue.Type(), want: f.etype})
}

   cases := f.sendCases
    //do send operation in the for loop 
for {
            .............
    }
   //3. wire sendLock chan in the end
   f.sendLock <- struct{}{}
   return nsent

}`
The bug is ,if the for loop is blocked,the step 3 would not excute, so the sendLock would result to
the next send operation blocked.Maybe there is no need to use sendLock chan in this function.

@fjl
Copy link
Contributor

fjl commented Nov 5, 2018

I think the send lock is correct because only one send should be running at any time. Send blocks until all channels have received the value.

@liuzhijun23
Copy link
Author

I think the send lock is correct because only one send should be running at any time. Send blocks until all channels have received the value.

But If one send is blocked,all the next send would be blocked.And I do not understand why the send should run in sequence but not in parallel.

@liuzhijun23
Copy link
Author

liuzhijun23 commented Nov 8, 2018

I have modify the send function as follow,it can solve the blocked problem.The code is as follow:

 func (f *Feed) Send(value interface{}) (nsent int) {
     rvalue := reflect.ValueOf(value)

	f.once.Do(f.init)
	//<-f.sendLock

	sendCases := caseList{{Chan: reflect.ValueOf(f.removeSub), Dir: reflect.SelectRecv}}
	sendCases = append(sendCases, f.inbox...)

	// Set the sent value on all channels.
	for i := firstSubSendCase; i < len(sendCases); i++ {
		sendCases[i].Send = rvalue
	}

	// Send until all channels except removeSub have been chosen. 'cases' tracks a prefix
	// of sendCases. When a send succeeds, the corresponding case moves to the end of
	// 'cases' and it shrinks by one element.
	cases := sendCases
	//LOOP:
	for {
		// Fast path: try sending without blocking before adding to the select set.
		// This should usually succeed if subscribers are fast enough and have free
		// buffer space.
		for i := firstSubSendCase; i < len(cases); i++ {
			if cases[i].Chan.TrySend(rvalue) {
				nsent++
				cases = cases.deactivate(i)
				i--
			}
		}
		if len(cases) == firstSubSendCase {
			break
		}
		// Select on all the receivers, waiting for them to unblock.
		chosen, recv, _ := reflect.Select(cases)
		//<-f.removeSub
		if chosen == 0  {
			index := f.sendCases.find(recv.Interface())
			f.sendCases = f.sendCases.delete(index)
			if index >= 0 && index < len(cases) {
				// Shrink 'cases' too because the removed case was still active.
				cases = f.sendCases[:len(cases)-1]
			}
		} else {
			cases = cases.deactivate(chosen)
			nsent++
		}
	}

	// Forget about the sent value and hand off the send lock.
	for i := firstSubSendCase; i < len(f.sendCases); i++ {
		f.sendCases[i].Send = reflect.Value{}
	}
	//f.sendLock <- struct{}{}
	return nsent
}_

The pre send maybe blocked,but it could not block the next send.And I make the public access as less as possible.

@liuzhijun23
Copy link
Author

liuzhijun23 commented Nov 8, 2018

The way to call the send should also very careful.Let me see the function in blockchain.go:

func (bc *BlockChain) PostChainEvents(events []interface{}, logs []*types.Log) {
	log.Info("lzj-log PostChainEvents", "events len",len(events))
	// post event logs for further processing
	if logs != nil {
		bc.logsFeed.Send(logs)
	}
	for _, event := range events {
		switch ev := event.(type) {
		case ChainEvent:
			log.Info("lzj-log send ChainEvent")
			bc.chainFeed.Send(ev)

		case ChainHeadEvent:
			log.Info("lzj-log send ChainHeadEvent")
			bc.chainHeadFeed.Send(ev)

		case ChainSideEvent:
			log.Info("lzj-log send ChainSideEvent")
			bc.chainSideFeed.Send(ev)
		}
	}
}

This function send three event in the for loop.If the pre send is blocked,the function would be blocked,and the next send would never be sent.We can modify the function as follow:

func (bc *BlockChain) PostChainEvents(events []interface{}, logs []*types.Log) {
	log.Info("lzj-log PostChainEvents", "events len",len(events))
	// post event logs for further processing
	if logs != nil {
		bc.logsFeed.Send(logs)
	}
	for _, event := range events {
		switch ev := event.(type) {
		case ChainEvent:
			log.Info("lzj-log send ChainEvent")
			go bc.chainFeed.Send(ev)//lzj add go

		case ChainHeadEvent:
			log.Info("lzj-log send ChainHeadEvent")
			go bc.chainHeadFeed.Send(ev)//lzj add go

		case ChainSideEvent:
			log.Info("lzj-log send ChainSideEvent")
			go bc.chainSideFeed.Send(ev)//lzj add go
		}
	}
}

I just put the send function in a separate go routine.so,the event would be sent all.

@hadv
Copy link
Contributor

hadv commented Nov 23, 2018

@liuzhijun23 Can you make a patch or PR then everyone can confirm your fix. Thank you!

@liuzhijun23
Copy link
Author

liuzhijun23 commented Nov 23, 2018

@liuzhijun23 Can you make a patch or PR then everyone can confirm your fix. Thank you!
@hadv
The bug can replay as follow:

  1. Build a POA private blockchain with two nodes,one miner and one normal nodes which just be responsible for submit transactions.
  2. Build a script to submit transactions in a for loop,with javascript or golang.
  3. Run the script in several terminals to produce a lot of transactions to the normal node,with 100+ tps.
  4. Wait for about half an hour,check the txpool.status to see if the pendding is became more and more.When there are several thousands of penddings in txpool,and the eth.getBlock(eth.blockNumber).transactions.length is zero,It means that the transactions has blocked.
  5. Do as above, you must can replay the problem.The key point is that the transactions should 100+ tps.

@hadv
Copy link
Contributor

hadv commented Nov 23, 2018

yes, I can reproduce sometimes but still dont know what is the reason. I try to fix the problem as your code but it didnot work properly. Can you make a pull request to summary your fix? @liuzhijun23

Thank you!

@liuzhijun23
Copy link
Author

All my code change is shown above.The normal node maybe produce three errors:1,Do not broadcast transaction to the miner.2. Do not synchronize block from miner. 3. Do not update pendding transactions state.The reason of all these problem is that the Feed event blocked.You can add logs in the source codes where you have doubt to trace the program execute.You can find the reason. @hadv

@liuzhijun23
Copy link
Author

This is one of my problem resolve records in Chinese:
问题现象:在交易压力测试时通过txpool.pendding发现普通节点的pendding交易太多,交易停止时这些交易也不会清除。通过查询区块链发现这些交易已经被打包成功,只是交易池没有更新。
问题分析:当有新区快来时都会发生pool.chainHeadCh消息,发现当pendding拥堵时没有得到这个消息。追踪源码,发现发生故障时虽然接收到区块头消息,但是没有发送chainHeadCh事件,在发生该事件的函数中堵塞了。
解决方案:将发送函数PostChainEvents()中每个事件的发送单独放入协程,以免某个事件堵塞影响其它事件的发送。
@hadv

@hadv
Copy link
Contributor

hadv commented Nov 24, 2018

Thank you @liuzhijun23

I will investigate more. Even I use separate goroutine for sending events in the PostChainEvents() but I still face the problem when all the network cannot mine any new block. They just propagate the tnx between each others and the new block only be mined when the txpool reach the maximum pool size.

@liuzhijun23
Copy link
Author

@hadv I use POA consensus and I set the block period 5s.The block would always be mined with 5s period.

@hadv
Copy link
Contributor

hadv commented Nov 26, 2018

@hadv I use POA consensus and I set the block period 5s.The block would always be mined with 5s period.

I'm also using POA but with block period 2s. Yes, normally network will mint new block every ~2s but when making the loading there's a case that I mentioned above at which all the network stuck at a block number and only can continue mint new block when the total pending tnx reach txpool size or when stoping sending more tnx. it seems that all the node busy with tnx execution and postpone the block mining. I think there's bugs somewhere.

@hadv
Copy link
Contributor

hadv commented Nov 29, 2018

tracing the geth log when the network is stuck, I found that all the miners re-submit mining work continuously at the same block number that means new block cannot be created

recommit block every 3s by default due to this PR #17413

INFO [11-29|06:33:24.128] Commit new mining work                   number=1047 sealhash=6bb9f4…4d42d1 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.603ms
INFO [11-29|06:33:27.129] Commit new mining work                   number=1047 sealhash=e7593a…3a4932 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.724ms
INFO [11-29|06:33:30.129] Commit new mining work                   number=1047 sealhash=e37ff7…4bfdd8 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.626ms
INFO [11-29|06:33:33.129] Commit new mining work                   number=1047 sealhash=74986f…b86977 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.260ms
INFO [11-29|06:33:36.156] Commit new mining work                   number=1047 sealhash=1002f4…a1efc4 uncles=0 txs=508 gas=13984224 fees=0 elapsed=52.010ms
INFO [11-29|06:33:39.130] Commit new mining work                   number=1047 sealhash=e65bfd…ed2197 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.967ms
INFO [11-29|06:33:42.131] Commit new mining work                   number=1047 sealhash=53f369…b9065a uncles=0 txs=508 gas=13984224 fees=0 elapsed=26.823ms
INFO [11-29|06:33:45.130] Commit new mining work                   number=1047 sealhash=56f8ef…b3aa13 uncles=0 txs=508 gas=13984224 fees=0 elapsed=26.023ms
INFO [11-29|06:33:48.130] Commit new mining work                   number=1047 sealhash=6dc3fa…34cd6f uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.824ms
INFO [11-29|06:33:51.131] Commit new mining work                   number=1047 sealhash=ddbfa3…e564a0 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.852ms
INFO [11-29|06:33:54.131] Commit new mining work                   number=1047 sealhash=4e9892…2fe0c1 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.951ms
INFO [11-29|06:33:57.131] Commit new mining work                   number=1047 sealhash=c6daf3…3d3c7a uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.832ms
INFO [11-29|06:34:00.131] Commit new mining work                   number=1047 sealhash=3335db…b6ebd9 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.861ms
INFO [11-29|06:34:03.131] Commit new mining work                   number=1047 sealhash=e21273…679f02 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.691ms
INFO [11-29|06:34:06.132] Commit new mining work                   number=1047 sealhash=339171…683afb uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.728ms

@liuzhijun23
Copy link
Author

Are there many miners in your blockchain?You can make only one miner node and one normal node in your blockchain to make it simple.

@hadv
Copy link
Contributor

hadv commented Nov 29, 2018

Are there many miners in your blockchain?You can make only one miner node and one normal node in your blockchain to make it simple.

yes, I'm running on 54 miners. There should be problem make the miner stuck in Commit new mining work

@liuzhijun23
Copy link
Author

You can make it simple first to locate the bug.

@hadv
Copy link
Contributor

hadv commented Jan 29, 2019

@liuzhijun23 your fix still make goroutine leak on this line as I tested on my private chain

        // Select on all the receivers, waiting for them to unblock.
	chosen, recv, _ := reflect.Select(cases)

@fjl fjl self-assigned this Jan 29, 2019
@fjl
Copy link
Contributor

fjl commented Jan 29, 2019

The original issue is about event.Feed. The design of event.Feed is to send values sequentially like a go channel, but to all subscribers. If mining gets stuck for concurrency reasons, please open a new issue.

@fjl fjl closed this as completed Jan 29, 2019
@fjl fjl removed the status:triage label Jan 29, 2019
@hadv
Copy link
Contributor

hadv commented Jan 29, 2019

no mining rpc endpoint node have goroutine leak due to the event.Feed as I mentioned in #17450 @fjl

@hadv
Copy link
Contributor

hadv commented Jan 29, 2019

The original issue is about event.Feed. The design of event.Feed is to send values sequentially like a go channel, but to all subscribers. If mining gets stuck for concurrency reasons, please open a new issue.

@fjl #18082 & #18062

@lutianzhou001
Copy link

@liuzhijun23 我也遇到了同样的问题,但是我用了你的代码修改了,但是还是不起作用。

@shikhars371
Copy link

transaction getting stuck in tx mining pool for more than 1/2 hour , its a private blockchain i am running, and a single node im doing mining, please help me

@hadv
Copy link
Contributor

hadv commented Jul 12, 2019

transaction getting stuck in tx mining pool for more than 1/2 hour , its a private blockchain i am running, and a single node im doing mining, please help me

try to use 1.9.0, hope it can resolve the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants