Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: reduce client/channel lock contention #701

Merged
merged 4 commits into from
Jan 7, 2016

Conversation

mreiferson
Copy link
Member

While evaluating a production performance issue I uncovered some
unnecessary lock contention inside the channel and client code
paths.

The ultimate cause of the issue was that we held client-wide,
channel-wide (and ultimately NSQ-wide, see #700) locks during network
operations, and if those operations block it would prevent other
simultaneous, unrelated, operations from making progress.

doRequeue is always called inside exitLock now
@mreiferson mreiferson added the perf label Jan 6, 2016
@mreiferson
Copy link
Member Author

seeing these race detector failures:

==================
WARNING: DATA RACE
Read by goroutine 34:
  github.com/nsqio/nsq/nsqd.TestHTTPpubDefer()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/http_test.go:180 +0x966
  testing.tRunner()
      /usr/local/Cellar/go/1.6beta1/libexec/src/testing/testing.go:466 +0xdc

Previous write by goroutine 46:
  runtime.mapassign1()
      /usr/local/Cellar/go/1.6beta1/libexec/src/runtime/hashmap.go:429 +0x0
  github.com/nsqio/nsq/nsqd.(*Channel).pushDeferredMessage()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/channel.go:515 +0x2f5
  github.com/nsqio/nsq/nsqd.(*Channel).StartDeferredTimeout()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/channel.go:439 +0x202
  github.com/nsqio/nsq/nsqd.(*Topic).messagePump()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/topic.go:278 +0x62f
  github.com/nsqio/nsq/nsqd.NewTopic.func1()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/topic.go:66 +0x2d
  github.com/nsqio/nsq/internal/util.(*WaitGroupWrapper).Wrap.func1()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/internal/util/wait_group_wrapper.go:14 +0x2a

Goroutine 34 (running) created at:
  testing.RunTests()
      /usr/local/Cellar/go/1.6beta1/libexec/src/testing/testing.go:575 +0xaaf
  testing.(*M).Run()
      /usr/local/Cellar/go/1.6beta1/libexec/src/testing/testing.go:508 +0x11d
  main.main()
      github.com/nsqio/nsq/nsqd/_test/_testmain.go:356 +0x210

Goroutine 46 (running) created at:
  github.com/nsqio/nsq/internal/util.(*WaitGroupWrapper).Wrap()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/internal/util/wait_group_wrapper.go:16 +0x6a
  github.com/nsqio/nsq/nsqd.NewTopic()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/topic.go:66 +0x5df
  github.com/nsqio/nsq/nsqd.(*NSQD).GetTopic()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/nsqd.go:443 +0x243
  github.com/nsqio/nsq/nsqd.TestHTTPpubDefer()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/http_test.go:166 +0x2e2
  testing.tRunner()
      /usr/local/Cellar/go/1.6beta1/libexec/src/testing/testing.go:466 +0xdc
==================
==================
WARNING: DATA RACE
Read by goroutine 43:
  github.com/nsqio/nsq/nsqd.TestDPUB()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/protocol_v2_test.go:584 +0xbab
  testing.tRunner()
      /usr/local/Cellar/go/1.6beta1/libexec/src/testing/testing.go:466 +0xdc

Previous write by goroutine 33:
  runtime.mapassign1()
      /usr/local/Cellar/go/1.6beta1/libexec/src/runtime/hashmap.go:429 +0x0
  github.com/nsqio/nsq/nsqd.(*Channel).pushDeferredMessage()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/channel.go:515 +0x2f5
  github.com/nsqio/nsq/nsqd.(*Channel).StartDeferredTimeout()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/channel.go:439 +0x202
  github.com/nsqio/nsq/nsqd.(*Topic).messagePump()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/topic.go:278 +0x62f
  github.com/nsqio/nsq/nsqd.NewTopic.func1()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/topic.go:66 +0x2d
  github.com/nsqio/nsq/internal/util.(*WaitGroupWrapper).Wrap.func1()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/internal/util/wait_group_wrapper.go:14 +0x2a

Goroutine 43 (running) created at:
  testing.RunTests()
      /usr/local/Cellar/go/1.6beta1/libexec/src/testing/testing.go:575 +0xaaf
  testing.(*M).Run()
      /usr/local/Cellar/go/1.6beta1/libexec/src/testing/testing.go:508 +0x11d
  main.main()
      github.com/nsqio/nsq/nsqd/_test/_testmain.go:356 +0x210

Goroutine 33 (running) created at:
  github.com/nsqio/nsq/internal/util.(*WaitGroupWrapper).Wrap()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/internal/util/wait_group_wrapper.go:16 +0x6a
  github.com/nsqio/nsq/nsqd.NewTopic()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/topic.go:66 +0x5df
  github.com/nsqio/nsq/nsqd.(*NSQD).GetTopic()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/nsqd.go:443 +0x243
  github.com/nsqio/nsq/nsqd.(*protocolV2).SUB()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/protocol_v2.go:598 +0xeaa
  github.com/nsqio/nsq/nsqd.(*protocolV2).Exec()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/protocol_v2.go:197 +0x1040
  github.com/nsqio/nsq/nsqd.(*protocolV2).IOLoop()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/protocol_v2.go:86 +0xb45
  github.com/nsqio/nsq/nsqd.(*tcpServer).Handle()
      /Users/mreiferson/dev/src/github.com/nsqio/nsq/nsqd/tcp.go:43 +0x700
==================

@mreiferson
Copy link
Member Author

this is exciting!

just as we don't need to lock the whole instance for the lifetime
of the stats gathering, we don't need to lock the topic for the
lifetime of the channel stats gathering
@mreiferson mreiferson force-pushed the lock_contention_701 branch from d887f43 to 48b2872 Compare January 7, 2016 01:03
@@ -214,7 +215,7 @@ func (c *clientV2) Identify(data identifyDataV2) error {
}

func (c *clientV2) Stats() ClientStats {
c.RLock()
c.metaLock.RLock()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this was ultimately the most problematic case.

Previously, this was competing with network writes to this client here, so when that blocked, this blocked. It's called as part of the underlying code path for /stats here. Prior to #700, the body of nsqd.GetStats() would hold an nsqd-wide lock for the entire lifetime of the function call, meaning that a single blocking client could effectively block any operations that required a top-level nsqd-wide lock, or a topic/channel-level lock for whichever the misbehaving client was subscribed to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. this has the really nice side-effect of making /stats requests really fast, regardless of nsqd load, addressing long-standing issues with nsqadmin timing out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay!

@mreiferson
Copy link
Member Author

green light RFR @jehiah

jehiah added a commit that referenced this pull request Jan 7, 2016
nsqd: reduce client/channel lock contention
@jehiah jehiah merged commit bc0a6b9 into nsqio:master Jan 7, 2016
@mreiferson mreiferson deleted the lock_contention_701 branch January 7, 2016 04:09
@jehiah
Copy link
Member

jehiah commented Jan 7, 2016

@zachbadgett thanks for the follow up review on this!

@zachbadgett
Copy link
Contributor

Of course :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants