Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: ephemeral channel race condition with reconnecting client #886

Closed
wants to merge 1 commit into from

Conversation

slayercat
Copy link

@slayercat slayercat commented Apr 17, 2017

It's possible for a client reconnecting quickly and subscribed to an ephemeral channel to race with nsqd's cleanup of said ephemeral channel, as documented in nsqio/go-nsq#206.

Fixes #883

@slayercat
Copy link
Author

The pem file in test is out-dated. so the check failed.

@mreiferson
Copy link
Member

I've got to think through this one, thanks for the fix though @slayercat!

I'll also fix the TLS certs...

@mreiferson mreiferson changed the title when topic destroying. make nsqd won't assign any new client into it nsqd: ephemeral channel race condition with reconnecting client Apr 17, 2017
@mreiferson mreiferson added the bug label Apr 17, 2017
@mreiferson
Copy link
Member

Fixed TLS certs in 932b3fa

@slayercat
Copy link
Author

I've just sent a new commit to make the global lock small enough.

@ploxiln
Copy link
Member

ploxiln commented Apr 18, 2017

The comment is now out-of-date:

// we do this before removing the topic from map below (with no lock)
// so that any incoming writes will error and not create a new topic

I haven't thought enough about whether we still need that ordering ... @mreiferson would know better than me anyway

@slayercat
Copy link
Author

Commit a0a6ce6 imports the change.

I'm not sure what's this commit for, but seems it make sense to disable client connect to topic if the topic is been destroyed

=====

commit a0a6ce6
Author: Matt Reiferson [email protected]
Date: Thu Apr 25 13:51:14 2013 -0400

guarantee ordering of deletion when topic is live

@slayercat
Copy link
Author

How about this?

@ploxiln @mreiferson

diff --git a/nsqd/protocol_v2.go b/nsqd/protocol_v2.go
index 02ba0da..d4536ba 100644
--- a/nsqd/protocol_v2.go
+++ b/nsqd/protocol_v2.go
@@ -616,6 +616,10 @@ func (p *protocolV2) SUB(client *clientV2, params [][]byte) ([]byte, error) {
        }
 
        topic := p.ctx.nsqd.GetTopic(topicName)
+       if topic.Exiting() {
+               return nil, protocol.NewFatalClientErr(nil, "E_BAD_TOPIC",
+                       fmt.Sprintf("SUB topic %q exiting", channelName))
+       }
        channel := topic.GetChannel(channelName)
        channel.AddClient(client.ID, client)


@mreiferson
Copy link
Member

OK, I poked around at this. I think nsqd.DeleteExistingTopic() is correct and should not be modified. I think the race is actually between the three separate calls in https://github.com/nsqio/nsq/blob/master/nsqd/protocol_v2.go#L618-L620.

Perhaps what we need is a function at the topic level that acquires a lock and performs these three actions?

@slayercat slayercat force-pushed the fix-issue-883 branch 2 times, most recently from 0eb302c to 5bcb654 Compare May 2, 2017 05:14
@slayercat
Copy link
Author

Commited a new PR for review.

@mreiferson

…ent.

It's possible for a client reconnecting quickly and subscribed to an ephemeral channel to race with nsqd's cleanup of said ephemeral channel, as documented in nsqio/go-nsq#206.

Fixes nsqio#883
@ploxiln
Copy link
Member

ploxiln commented May 2, 2017

For a sleep-retry loop, one microsecond is too short. A context switch can take a couple of microseconds. I suggest a millisecond. (Maybe even 10ms).

I think this does not quite prevent the race, though it might narrow it down significantly. It's still theoretically possible for the channel to start exiting between the check Channel.Exiting() and Channel.AddClient(), or between Topic.Exiting() and Topic.GetChannel().

I have an idea which may help: check for Channel.Exiting() in Channel.AddClient(), after taking the channel lock, before adding the client. If exiting, return error, and do the sleep/retry loop. Why this would work is a bit tricky: if the exit flag is not yet set, Channel.exit() has not yet closed all the clients, which it does with the channel lock held. It cannot start closing them while AddClient() holds the channel lock. So this guarantees that if there is a race with exit(), it will close this client, which should reconnect.

The Topic.exit() race is trickier, I need to think about that more ...

@ploxiln
Copy link
Member

ploxiln commented May 2, 2017

I take back the suggestion I wrote in the second half of my comment, I don't think that sufficiently improves the current PR.

Maybe the best way, without changing too much of the locking and structure of Topic and Channel, is to:

  1. GetTopic(), GetChannel(), AddClient()
  2. Check topic.Exiting() and channel.Exiting()
  3. only if exiting, try to remove client from channel
  4. only if exiting and ephemeral and not tombstoned, sleep and loop

EDIT: tombstone check might not be relevant here

@mreiferson
Copy link
Member

I think @ploxiln's last proposal sounds like a reasonable "lock-free" approach. My original proposal still stands, which is to promote these 3 independent operations to a single operation at either the topic or channel level.

@ploxiln
Copy link
Member

ploxiln commented May 4, 2017

I think that the "correct locking" option, "promote to a single operation", would require a single mutex for the whole NSQD to be locked when any consumer connects or disconnects / is cleaned-up on any channel which is ephemeral.

If the lock is per-channel, in the Channel struct, the new consumer could get the channel and try to lock it just after the previous consumer locks it. The new consumer will finally get the lock after the Channel is already exited (just not yet garbage collected), and will have to loop and retry, hoping to get a brand new ephemeral Channel with the same name.

If the lock is per-topic, in the Topic struct, the same logic applies if the topic is also ephemeral - a new consumer might get the topic and try to lock it, but only actually gain ownership of the lock after the topic is already doomed, and will need to loop and retry.

Thus, a whole-NSQD "ephemeral lock" will need to be held, to avoid a consumer connection even calling GetTopic() for an ephemeral topic, at the same time as another consumer connection calls RemoveClient() when on such a topic.

@ploxiln
Copy link
Member

ploxiln commented May 4, 2017

I don't think my scheme would require consumers connecting to non-ephemeral topics and channels to take any such lock, because non-ephemeral topics and channels are only deliberately deleted by the operator, and immediate resurrection is not appropriate for them.

@slayercat
Copy link
Author

#891 fixs the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants