I find a bug in event/feed.go #18021

liuzhijun23 · 2018-11-02T08:05:28Z

I read the resource code of event/feed.go in go-ethereum Version 1.8.17,I find there is a bug in Send function。Let me see the code about sendLock:
`func (f *Feed) Send(value interface{}) (nsent int) {
.........
f.once.Do(f.init)
//1. read sendLock chan
<-f.sendLock

   ............
   if !f.typecheck(rvalue.Type()) {
           //2. there is a panic,write sendLock chan
	f.sendLock <- struct{}{}
	panic(feedTypeError{op: "Send", got: rvalue.Type(), want: f.etype})
}

   cases := f.sendCases
    //do send operation in the for loop 
for {
            .............
    }
   //3. wire sendLock chan in the end
   f.sendLock <- struct{}{}
   return nsent

}`
The bug is ,if the for loop is blocked，the step 3 would not excute, so the sendLock would result to
the next send operation blocked.Maybe there is no need to use sendLock chan in this function.

The text was updated successfully, but these errors were encountered:

fjl · 2018-11-05T19:29:10Z

I think the send lock is correct because only one send should be running at any time. Send blocks until all channels have received the value.

liuzhijun23 · 2018-11-06T01:21:35Z

I think the send lock is correct because only one send should be running at any time. Send blocks until all channels have received the value.

But If one send is blocked,all the next send would be blocked.And I do not understand why the send should run in sequence but not in parallel.

liuzhijun23 · 2018-11-08T10:39:04Z

I have modify the send function as follow,it can solve the blocked problem.The code is as follow:

 func (f *Feed) Send(value interface{}) (nsent int) {
     rvalue := reflect.ValueOf(value)

	f.once.Do(f.init)
	//<-f.sendLock

	sendCases := caseList{{Chan: reflect.ValueOf(f.removeSub), Dir: reflect.SelectRecv}}
	sendCases = append(sendCases, f.inbox...)

	// Set the sent value on all channels.
	for i := firstSubSendCase; i < len(sendCases); i++ {
		sendCases[i].Send = rvalue
	}

	// Send until all channels except removeSub have been chosen. 'cases' tracks a prefix
	// of sendCases. When a send succeeds, the corresponding case moves to the end of
	// 'cases' and it shrinks by one element.
	cases := sendCases
	//LOOP:
	for {
		// Fast path: try sending without blocking before adding to the select set.
		// This should usually succeed if subscribers are fast enough and have free
		// buffer space.
		for i := firstSubSendCase; i < len(cases); i++ {
			if cases[i].Chan.TrySend(rvalue) {
				nsent++
				cases = cases.deactivate(i)
				i--
			}
		}
		if len(cases) == firstSubSendCase {
			break
		}
		// Select on all the receivers, waiting for them to unblock.
		chosen, recv, _ := reflect.Select(cases)
		//<-f.removeSub
		if chosen == 0  {
			index := f.sendCases.find(recv.Interface())
			f.sendCases = f.sendCases.delete(index)
			if index >= 0 && index < len(cases) {
				// Shrink 'cases' too because the removed case was still active.
				cases = f.sendCases[:len(cases)-1]
			}
		} else {
			cases = cases.deactivate(chosen)
			nsent++
		}
	}

	// Forget about the sent value and hand off the send lock.
	for i := firstSubSendCase; i < len(f.sendCases); i++ {
		f.sendCases[i].Send = reflect.Value{}
	}
	//f.sendLock <- struct{}{}
	return nsent
}_

The pre send maybe blocked,but it could not block the next send.And I make the public access as less as possible.

liuzhijun23 · 2018-11-08T10:45:05Z

The way to call the send should also very careful.Let me see the function in blockchain.go:

func (bc *BlockChain) PostChainEvents(events []interface{}, logs []*types.Log) {
	log.Info("lzj-log PostChainEvents", "events len",len(events))
	// post event logs for further processing
	if logs != nil {
		bc.logsFeed.Send(logs)
	}
	for _, event := range events {
		switch ev := event.(type) {
		case ChainEvent:
			log.Info("lzj-log send ChainEvent")
			bc.chainFeed.Send(ev)

		case ChainHeadEvent:
			log.Info("lzj-log send ChainHeadEvent")
			bc.chainHeadFeed.Send(ev)

		case ChainSideEvent:
			log.Info("lzj-log send ChainSideEvent")
			bc.chainSideFeed.Send(ev)
		}
	}
}

This function send three event in the for loop.If the pre send is blocked,the function would be blocked,and the next send would never be sent.We can modify the function as follow:

func (bc *BlockChain) PostChainEvents(events []interface{}, logs []*types.Log) {
	log.Info("lzj-log PostChainEvents", "events len",len(events))
	// post event logs for further processing
	if logs != nil {
		bc.logsFeed.Send(logs)
	}
	for _, event := range events {
		switch ev := event.(type) {
		case ChainEvent:
			log.Info("lzj-log send ChainEvent")
			go bc.chainFeed.Send(ev)//lzj add go

		case ChainHeadEvent:
			log.Info("lzj-log send ChainHeadEvent")
			go bc.chainHeadFeed.Send(ev)//lzj add go

		case ChainSideEvent:
			log.Info("lzj-log send ChainSideEvent")
			go bc.chainSideFeed.Send(ev)//lzj add go
		}
	}
}

I just put the send function in a separate go routine.so,the event would be sent all.

hadv · 2018-11-23T08:30:18Z

@liuzhijun23 Can you make a patch or PR then everyone can confirm your fix. Thank you!

liuzhijun23 · 2018-11-23T09:45:34Z

@liuzhijun23 Can you make a patch or PR then everyone can confirm your fix. Thank you!
@hadv
The bug can replay as follow:

Build a POA private blockchain with two nodes,one miner and one normal nodes which just be responsible for submit transactions.
Build a script to submit transactions in a for loop,with javascript or golang.
Run the script in several terminals to produce a lot of transactions to the normal node,with 100+ tps.
Wait for about half an hour,check the txpool.status to see if the pendding is became more and more.When there are several thousands of penddings in txpool,and the eth.getBlock(eth.blockNumber).transactions.length is zero,It means that the transactions has blocked.
Do as above, you must can replay the problem.The key point is that the transactions should 100+ tps.

hadv · 2018-11-23T11:02:59Z

yes, I can reproduce sometimes but still dont know what is the reason. I try to fix the problem as your code but it didnot work properly. Can you make a pull request to summary your fix? @liuzhijun23

Thank you!

liuzhijun23 · 2018-11-24T01:38:17Z

All my code change is shown above.The normal node maybe produce three errors:1,Do not broadcast transaction to the miner.2. Do not synchronize block from miner. 3. Do not update pendding transactions state.The reason of all these problem is that the Feed event blocked.You can add logs in the source codes where you have doubt to trace the program execute.You can find the reason. @hadv

liuzhijun23 · 2018-11-24T01:47:14Z

This is one of my problem resolve records in Chinese:
问题现象：在交易压力测试时通过txpool.pendding发现普通节点的pendding交易太多，交易停止时这些交易也不会清除。通过查询区块链发现这些交易已经被打包成功，只是交易池没有更新。
问题分析：当有新区快来时都会发生pool.chainHeadCh消息，发现当pendding拥堵时没有得到这个消息。追踪源码，发现发生故障时虽然接收到区块头消息，但是没有发送chainHeadCh事件，在发生该事件的函数中堵塞了。
解决方案：将发送函数PostChainEvents()中每个事件的发送单独放入协程，以免某个事件堵塞影响其它事件的发送。
@hadv

hadv · 2018-11-24T02:44:42Z

Thank you @liuzhijun23

I will investigate more. Even I use separate goroutine for sending events in the PostChainEvents() but I still face the problem when all the network cannot mine any new block. They just propagate the tnx between each others and the new block only be mined when the txpool reach the maximum pool size.

liuzhijun23 · 2018-11-24T03:41:15Z

@hadv I use POA consensus and I set the block period 5s.The block would always be mined with 5s period.

hadv · 2018-11-26T03:52:28Z

@hadv I use POA consensus and I set the block period 5s.The block would always be mined with 5s period.

I'm also using POA but with block period 2s. Yes, normally network will mint new block every ~2s but when making the loading there's a case that I mentioned above at which all the network stuck at a block number and only can continue mint new block when the total pending tnx reach txpool size or when stoping sending more tnx. it seems that all the node busy with tnx execution and postpone the block mining. I think there's bugs somewhere.

hadv · 2018-11-29T06:37:35Z

tracing the geth log when the network is stuck, I found that all the miners re-submit mining work continuously at the same block number that means new block cannot be created

recommit block every 3s by default due to this PR #17413

INFO [11-29|06:33:24.128] Commit new mining work                   number=1047 sealhash=6bb9f4…4d42d1 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.603ms
INFO [11-29|06:33:27.129] Commit new mining work                   number=1047 sealhash=e7593a…3a4932 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.724ms
INFO [11-29|06:33:30.129] Commit new mining work                   number=1047 sealhash=e37ff7…4bfdd8 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.626ms
INFO [11-29|06:33:33.129] Commit new mining work                   number=1047 sealhash=74986f…b86977 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.260ms
INFO [11-29|06:33:36.156] Commit new mining work                   number=1047 sealhash=1002f4…a1efc4 uncles=0 txs=508 gas=13984224 fees=0 elapsed=52.010ms
INFO [11-29|06:33:39.130] Commit new mining work                   number=1047 sealhash=e65bfd…ed2197 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.967ms
INFO [11-29|06:33:42.131] Commit new mining work                   number=1047 sealhash=53f369…b9065a uncles=0 txs=508 gas=13984224 fees=0 elapsed=26.823ms
INFO [11-29|06:33:45.130] Commit new mining work                   number=1047 sealhash=56f8ef…b3aa13 uncles=0 txs=508 gas=13984224 fees=0 elapsed=26.023ms
INFO [11-29|06:33:48.130] Commit new mining work                   number=1047 sealhash=6dc3fa…34cd6f uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.824ms
INFO [11-29|06:33:51.131] Commit new mining work                   number=1047 sealhash=ddbfa3…e564a0 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.852ms
INFO [11-29|06:33:54.131] Commit new mining work                   number=1047 sealhash=4e9892…2fe0c1 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.951ms
INFO [11-29|06:33:57.131] Commit new mining work                   number=1047 sealhash=c6daf3…3d3c7a uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.832ms
INFO [11-29|06:34:00.131] Commit new mining work                   number=1047 sealhash=3335db…b6ebd9 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.861ms
INFO [11-29|06:34:03.131] Commit new mining work                   number=1047 sealhash=e21273…679f02 uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.691ms
INFO [11-29|06:34:06.132] Commit new mining work                   number=1047 sealhash=339171…683afb uncles=0 txs=508 gas=13984224 fees=0 elapsed=25.728ms

liuzhijun23 · 2018-11-29T06:55:39Z

Are there many miners in your blockchain?You can make only one miner node and one normal node in your blockchain to make it simple.

hadv · 2018-11-29T06:57:49Z

Are there many miners in your blockchain?You can make only one miner node and one normal node in your blockchain to make it simple.

yes, I'm running on 54 miners. There should be problem make the miner stuck in Commit new mining work

liuzhijun23 · 2018-11-29T07:00:07Z

You can make it simple first to locate the bug.

hadv · 2019-01-29T05:08:12Z

@liuzhijun23 your fix still make goroutine leak on this line as I tested on my private chain

        // Select on all the receivers, waiting for them to unblock.
	chosen, recv, _ := reflect.Select(cases)

fjl · 2019-01-29T11:22:49Z

The original issue is about event.Feed. The design of event.Feed is to send values sequentially like a go channel, but to all subscribers. If mining gets stuck for concurrency reasons, please open a new issue.

hadv · 2019-01-29T11:36:26Z

no mining rpc endpoint node have goroutine leak due to the event.Feed as I mentioned in #17450 @fjl

hadv · 2019-01-29T14:43:57Z

The original issue is about event.Feed. The design of event.Feed is to send values sequentially like a go channel, but to all subscribers. If mining gets stuck for concurrency reasons, please open a new issue.

@fjl #18082 & #18062

lutianzhou001 · 2019-02-19T08:49:26Z

@liuzhijun23 我也遇到了同样的问题，但是我用了你的代码修改了，但是还是不起作用。

shikhars371 · 2019-07-12T12:56:53Z

transaction getting stuck in tx mining pool for more than 1/2 hour , its a private blockchain i am running, and a single node im doing mining, please help me

hadv · 2019-07-12T13:15:15Z

transaction getting stuck in tx mining pool for more than 1/2 hour , its a private blockchain i am running, and a single node im doing mining, please help me

try to use 1.9.0, hope it can resolve the issue

liuzhijun23 mentioned this issue Nov 14, 2018

POA private network: Tx pending in txpool #18082

Closed

adamschmideg added the status:triage label Jan 22, 2019

hadv mentioned this issue Jan 29, 2019

Memory used increasing slowly #17450

Closed

fjl self-assigned this Jan 29, 2019

fjl closed this as completed Jan 29, 2019

fjl removed the status:triage label Jan 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I find a bug in event/feed.go #18021

I find a bug in event/feed.go #18021

liuzhijun23 commented Nov 2, 2018

fjl commented Nov 5, 2018

liuzhijun23 commented Nov 6, 2018

liuzhijun23 commented Nov 8, 2018 •

edited by holiman

Loading

liuzhijun23 commented Nov 8, 2018 •

edited by holiman

Loading

hadv commented Nov 23, 2018

liuzhijun23 commented Nov 23, 2018 •

edited

Loading

hadv commented Nov 23, 2018

liuzhijun23 commented Nov 24, 2018

liuzhijun23 commented Nov 24, 2018

hadv commented Nov 24, 2018

liuzhijun23 commented Nov 24, 2018

hadv commented Nov 26, 2018

hadv commented Nov 29, 2018 •

edited

Loading

liuzhijun23 commented Nov 29, 2018

hadv commented Nov 29, 2018

liuzhijun23 commented Nov 29, 2018

hadv commented Jan 29, 2019 •

edited

Loading

fjl commented Jan 29, 2019

hadv commented Jan 29, 2019

hadv commented Jan 29, 2019 •

edited

Loading

lutianzhou001 commented Feb 19, 2019

shikhars371 commented Jul 12, 2019

hadv commented Jul 12, 2019

I find a bug in event/feed.go #18021

I find a bug in event/feed.go #18021

Comments

liuzhijun23 commented Nov 2, 2018

fjl commented Nov 5, 2018

liuzhijun23 commented Nov 6, 2018

liuzhijun23 commented Nov 8, 2018 • edited by holiman Loading

liuzhijun23 commented Nov 8, 2018 • edited by holiman Loading

hadv commented Nov 23, 2018

liuzhijun23 commented Nov 23, 2018 • edited Loading

hadv commented Nov 23, 2018

liuzhijun23 commented Nov 24, 2018

liuzhijun23 commented Nov 24, 2018

hadv commented Nov 24, 2018

liuzhijun23 commented Nov 24, 2018

hadv commented Nov 26, 2018

hadv commented Nov 29, 2018 • edited Loading

liuzhijun23 commented Nov 29, 2018

hadv commented Nov 29, 2018

liuzhijun23 commented Nov 29, 2018

hadv commented Jan 29, 2019 • edited Loading

fjl commented Jan 29, 2019

hadv commented Jan 29, 2019

hadv commented Jan 29, 2019 • edited Loading

lutianzhou001 commented Feb 19, 2019

shikhars371 commented Jul 12, 2019

hadv commented Jul 12, 2019

liuzhijun23 commented Nov 8, 2018 •

edited by holiman

Loading

liuzhijun23 commented Nov 8, 2018 •

edited by holiman

Loading

liuzhijun23 commented Nov 23, 2018 •

edited

Loading

hadv commented Nov 29, 2018 •

edited

Loading

hadv commented Jan 29, 2019 •

edited

Loading

hadv commented Jan 29, 2019 •

edited

Loading