Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP CLOSE_WAIT counts up to 500 and rpc error #8443

Closed
varnson opened this issue Aug 6, 2020 · 4 comments
Closed

TCP CLOSE_WAIT counts up to 500 and rpc error #8443

varnson opened this issue Aug 6, 2020 · 4 comments
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp

Comments

@varnson
Copy link

varnson commented Aug 6, 2020

Overview of the Issue

TCP CLOSE_WAIT counts increase from 0 to 500 in 2 hours, and java service report agent service register or get KV read-timeout error in logs.

Netstat shows CLOSE_WAIT are on server-port(not lan-port,wan-port,http-port).

We have 5 servers ,and the error occured with the leader. The leader will be reselected at the begin of error, but the CLOSE_WAIT counts will continue to increase. after restart the consul server process, the error disappeared.

After restarted, the go routinue counts was only 30% of the go routinue counts before the restart.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 1000 client nodes and 5 server nodes
  2. Run 1000 java service ,each with one client node.
  3. After 3 months, the error occured.
  4. There may be network router restarted(because of hardware problems) at the begin of error.

Consul info for both Client and Server

Consul version:1.5.3
OS: Redhat linux7.4

Possible Causes:
Consul RPC use tcp without keepalive,and each connection was processed with one go routinue.
With some network errors, the client need to reconnect to server, but the old connection(go toutinue) was not released by the server. So the server have to process more and more go routinues.

@jsosulska jsosulska added the type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp label Aug 7, 2020
@jsosulska
Copy link
Contributor

Hello @varnson , thank you for posting!

Can you please take a look at #8435 and see if this is a similar issue?

@jsosulska jsosulska added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 7, 2020
@varnson
Copy link
Author

varnson commented Aug 7, 2020

@jsosulska I don't think it is a similar issue with #8435.
First, the go routinue counts grows up every day, Just like memory leak.
Second, rpc error include get KV read timeout ,service register read timeout, and get KV(http://localhost:8500/v1/kv/xxxx) use tcp protocol, not serf.

@ghost ghost removed waiting-reply Waiting on response from Original Poster or another individual in the thread labels Aug 7, 2020
@varnson
Copy link
Author

varnson commented Aug 17, 2020

@jsosulska I've got the goroutine profile
Total 21086, raft had 159868 goroutines.
github.com/hashicorp/raft.(*Raft).GetConfiguration
raft/api.go:736

func (r *Raft) GetConfiguration() ConfigurationFuture {
configReq := &configurationsFuture{}
configReq.init()
select {
case <-r.shutdownCh:
configReq.respond(ErrRaftShutdown)
return configReq
case r.configurationsCh <- configReq:
return configReq
}
}

@ausmartway ausmartway changed the title TCP closewait counts up to 500 and rpc error TCP CLOSE_WAIT counts up to 500 and rpc error Aug 18, 2020
@varnson
Copy link
Author

varnson commented Jan 9, 2021

It's fixed by raft 1.1.2.

@varnson varnson closed this as completed Jan 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp
Projects
None yet
Development

No branches or pull requests

2 participants