Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: fix ephemeral channel sub live lock #1314

Merged
merged 8 commits into from
Apr 4, 2021

Conversation

mreiferson
Copy link
Member

Fixes #1251

I noticed a few things while debugging this...

The reproducible steps from #1251 caused file descriptor exhaustion in nsqd on my macOS laptop (default configuration) The stacktrace from nsqd confirmed what had previously been reported, a ton of contention on nsqd.RWMutex, but I noticed this in particular:

syscall.syscall(0x10b9ea0, 0xb, 0x33, 0x0, 0x0, 0x0, 0x0)
	/usr/local/Cellar/go/1.15.2/libexec/src/runtime/sys_darwin.go:63 +0x2e
syscall.fcntl(0xb, 0x33, 0x0, 0xc001697c38, 0x10d7df7, 0xc00017b620)
	/usr/local/Cellar/go/1.15.2/libexec/src/syscall/zsyscall_darwin_amd64.go:338 +0x58
internal/poll.(*FD).Fsync(0xc00017b620, 0x0, 0x0)
	/usr/local/Cellar/go/1.15.2/libexec/src/internal/poll/fd_fsync_darwin.go:18 +0xa8
os.(*File).Sync(0xc000318570, 0xc00060fb00, 0x25)
	/usr/local/Cellar/go/1.15.2/libexec/src/os/file_posix.go:158 +0x4c
github.com/nsqio/nsq/nsqd.writeSyncFile(0xc001281060, 0x20, 0xc00060fb00, 0x25, 0x30, 0xc001281060, 0x20)
	/Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/nsqd.go:322 +0xff
github.com/nsqio/nsq/nsqd.(*NSQD).PersistMetadata(0xc0001b6000, 0xc001b43f18, 0x2)
	/Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/nsqd.go:415 +0xb2b
github.com/nsqio/nsq/nsqd.(*NSQD).Notify.func1()
	/Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/nsqd.go:575 +0x133
github.com/nsqio/nsq/internal/util.(*WaitGroupWrapper).Wrap.func1(0xc000f843c0, 0xc0001b6100)
	/Users/mreiferson/dev/src/github.com/nsqio/nsq/internal/util/wait_group_wrapper.go:14 +0x27
created by github.com/nsqio/nsq/internal/util.(*WaitGroupWrapper).Wrap
	/Users/mreiferson/dev/src/github.com/nsqio/nsq/internal/util/wait_group_wrapper.go:13 +0x65

This was the goroutine ultimately holding the lock. We don't need to be persisting metadata for ephemeral topics/channels, so after changing that it "fixed" the issue in my testing. It would still be possible to construct a pathological test to cause this, so I additionally limited the retry loop and increased the timeout (which, IMO, is the "correct" behavior here anyway).

I fixed a few other things while debugging...

nsqd/topic.go Outdated Show resolved Hide resolved
nsqd/topic.go Show resolved Hide resolved
@mreiferson mreiferson force-pushed the ephemeral-live-lock-1251 branch 3 times, most recently from 1a3a6f6 to adc3e0d Compare March 6, 2021 18:38
@mreiferson mreiferson force-pushed the ephemeral-live-lock-1251 branch from 539d64b to ec2c44f Compare March 21, 2021 23:35
@mreiferson mreiferson requested a review from ploxiln March 22, 2021 00:44
@mreiferson
Copy link
Member Author

@ploxiln review this if you're reviewing things 😏

}

if (channel.ephemeral && channel.Exiting()) || (topic.ephemeral && topic.Exiting()) {
channel.RemoveClient(client.ID)
time.Sleep(1 * time.Millisecond)
continue
if i < 2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only two attempts ... could do a couple more to be "safe" ... but ephemeral channel clients can retry too, and already give up guarantees. ok, makes sense

return nil, protocol.NewFatalClientErr(nil, "E_TOO_MANY_CHANNEL_CONSUMERS",
fmt.Sprintf("channel consumers for %s:%s exceeds limit of %d",
topicName, channelName, p.nsqd.getOpts().MaxChannelConsumers))
return nil, protocol.NewFatalClientErr(err, "E_SUB_FAILED", "SUB failed "+err.Error())
Copy link
Member

@ploxiln ploxiln Mar 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, now this is where the exiting will be checked, and we won't retry for ephemeral channels exiting as intended

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, if we looked up a channel that is indeed exiting we want this to fail here.

@mreiferson mreiferson force-pushed the ephemeral-live-lock-1251 branch from ec2c44f to 62d202f Compare March 29, 2021 02:07
@mreiferson
Copy link
Member Author

@jehiah last chance before I land this 😏

Copy link
Member

@jehiah jehiah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I've not reviewed exhaustively, but this LGTM

@mreiferson mreiferson merged commit 619ac10 into nsqio:master Apr 4, 2021
@mreiferson mreiferson deleted the ephemeral-live-lock-1251 branch April 4, 2021 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nsqd: ephemeral topic/channel churn can live-lock
3 participants