nsqd: ephemeral channel race condition with reconnecting client #886

slayercat · 2017-04-17T01:40:32Z

It's possible for a client reconnecting quickly and subscribed to an ephemeral channel to race with nsqd's cleanup of said ephemeral channel, as documented in nsqio/go-nsq#206.

Fixes #883

slayercat · 2017-04-17T02:12:05Z

The pem file in test is out-dated. so the check failed.

mreiferson · 2017-04-17T17:21:05Z

I've got to think through this one, thanks for the fix though @slayercat!

I'll also fix the TLS certs...

mreiferson · 2017-04-17T17:25:19Z

Fixed TLS certs in 932b3fa

slayercat · 2017-04-18T05:30:09Z

I've just sent a new commit to make the global lock small enough.

ploxiln · 2017-04-18T14:42:44Z

The comment is now out-of-date:

// we do this before removing the topic from map below (with no lock)
// so that any incoming writes will error and not create a new topic

I haven't thought enough about whether we still need that ordering ... @mreiferson would know better than me anyway

slayercat · 2017-04-18T15:11:08Z

Commit a0a6ce6 imports the change.

I'm not sure what's this commit for, but seems it make sense to disable client connect to topic if the topic is been destroyed

=====

commit a0a6ce6
Author: Matt Reiferson [email protected]
Date: Thu Apr 25 13:51:14 2013 -0400

guarantee ordering of deletion when topic is live

slayercat · 2017-04-18T15:20:58Z

How about this?

@ploxiln @mreiferson

diff --git a/nsqd/protocol_v2.go b/nsqd/protocol_v2.go
index 02ba0da..d4536ba 100644
--- a/nsqd/protocol_v2.go
+++ b/nsqd/protocol_v2.go
@@ -616,6 +616,10 @@ func (p *protocolV2) SUB(client *clientV2, params [][]byte) ([]byte, error) {
        }
 
        topic := p.ctx.nsqd.GetTopic(topicName)
+       if topic.Exiting() {
+               return nil, protocol.NewFatalClientErr(nil, "E_BAD_TOPIC",
+                       fmt.Sprintf("SUB topic %q exiting", channelName))
+       }
        channel := topic.GetChannel(channelName)
        channel.AddClient(client.ID, client)

mreiferson · 2017-04-22T22:56:12Z

OK, I poked around at this. I think nsqd.DeleteExistingTopic() is correct and should not be modified. I think the race is actually between the three separate calls in https://github.com/nsqio/nsq/blob/master/nsqd/protocol_v2.go#L618-L620.

Perhaps what we need is a function at the topic level that acquires a lock and performs these three actions?

slayercat · 2017-05-02T05:15:32Z

Commited a new PR for review.

@mreiferson

…ent. It's possible for a client reconnecting quickly and subscribed to an ephemeral channel to race with nsqd's cleanup of said ephemeral channel, as documented in nsqio/go-nsq#206. Fixes nsqio#883

ploxiln · 2017-05-02T15:06:39Z

For a sleep-retry loop, one microsecond is too short. A context switch can take a couple of microseconds. I suggest a millisecond. (Maybe even 10ms).

I think this does not quite prevent the race, though it might narrow it down significantly. It's still theoretically possible for the channel to start exiting between the check Channel.Exiting() and Channel.AddClient(), or between Topic.Exiting() and Topic.GetChannel().

I have an idea which may help: check for Channel.Exiting() in Channel.AddClient(), after taking the channel lock, before adding the client. If exiting, return error, and do the sleep/retry loop. Why this would work is a bit tricky: if the exit flag is not yet set, Channel.exit() has not yet closed all the clients, which it does with the channel lock held. It cannot start closing them while AddClient() holds the channel lock. So this guarantees that if there is a race with exit(), it will close this client, which should reconnect.

The Topic.exit() race is trickier, I need to think about that more ...

ploxiln · 2017-05-02T18:29:04Z

I take back the suggestion I wrote in the second half of my comment, I don't think that sufficiently improves the current PR.

Maybe the best way, without changing too much of the locking and structure of Topic and Channel, is to:

GetTopic(), GetChannel(), AddClient()
Check topic.Exiting() and channel.Exiting()
only if exiting, try to remove client from channel
only if exiting and ephemeral and not tombstoned, sleep and loop

EDIT: tombstone check might not be relevant here

mreiferson · 2017-05-03T00:11:07Z

I think @ploxiln's last proposal sounds like a reasonable "lock-free" approach. My original proposal still stands, which is to promote these 3 independent operations to a single operation at either the topic or channel level.

ploxiln · 2017-05-04T14:42:02Z

I think that the "correct locking" option, "promote to a single operation", would require a single mutex for the whole NSQD to be locked when any consumer connects or disconnects / is cleaned-up on any channel which is ephemeral.

If the lock is per-channel, in the Channel struct, the new consumer could get the channel and try to lock it just after the previous consumer locks it. The new consumer will finally get the lock after the Channel is already exited (just not yet garbage collected), and will have to loop and retry, hoping to get a brand new ephemeral Channel with the same name.

If the lock is per-topic, in the Topic struct, the same logic applies if the topic is also ephemeral - a new consumer might get the topic and try to lock it, but only actually gain ownership of the lock after the topic is already doomed, and will need to loop and retry.

Thus, a whole-NSQD "ephemeral lock" will need to be held, to avoid a consumer connection even calling GetTopic() for an ephemeral topic, at the same time as another consumer connection calls RemoveClient() when on such a topic.

ploxiln · 2017-05-04T14:44:07Z

I don't think my scheme would require consumers connecting to non-ephemeral topics and channels to take any such lock, because non-ephemeral topics and channels are only deliberately deleted by the operator, and immediate resurrection is not appropriate for them.

slayercat · 2017-05-11T06:03:17Z

#891 fixs the issue.

mreiferson changed the title ~~when topic destroying. make nsqd won't assign any new client into it~~ nsqd: ephemeral channel race condition with reconnecting client Apr 17, 2017

mreiferson added the bug label Apr 17, 2017

mreiferson mentioned this pull request Apr 17, 2017

nsqd: ephemeral channel race condition with reconnecting client #883

Closed

slayercat force-pushed the fix-issue-883 branch from 4c42cab to b3d3769 Compare April 18, 2017 05:28

slayercat force-pushed the fix-issue-883 branch 2 times, most recently from 0eb302c to 5bcb654 Compare May 2, 2017 05:14

nsqd: SUB: fix ephemeral channel race condition with reconnecting cli…

bcd6938

…ent. It's possible for a client reconnecting quickly and subscribed to an ephemeral channel to race with nsqd's cleanup of said ephemeral channel, as documented in nsqio/go-nsq#206. Fixes nsqio#883

slayercat force-pushed the fix-issue-883 branch from 5bcb654 to bcd6938 Compare May 2, 2017 05:17

ploxiln mentioned this pull request May 8, 2017

nsqd: retry sub to ephemeral topic/channel which is Exiting #891

Merged

slayercat closed this May 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: ephemeral channel race condition with reconnecting client #886

nsqd: ephemeral channel race condition with reconnecting client #886

slayercat commented Apr 17, 2017 •

edited by mreiferson

Loading

slayercat commented Apr 17, 2017

mreiferson commented Apr 17, 2017

mreiferson commented Apr 17, 2017

slayercat commented Apr 18, 2017

ploxiln commented Apr 18, 2017

slayercat commented Apr 18, 2017

slayercat commented Apr 18, 2017

mreiferson commented Apr 22, 2017

slayercat commented May 2, 2017

ploxiln commented May 2, 2017

ploxiln commented May 2, 2017 •

edited

Loading

mreiferson commented May 3, 2017

ploxiln commented May 4, 2017

ploxiln commented May 4, 2017

slayercat commented May 11, 2017

nsqd: ephemeral channel race condition with reconnecting client #886

nsqd: ephemeral channel race condition with reconnecting client #886

Conversation

slayercat commented Apr 17, 2017 • edited by mreiferson Loading

slayercat commented Apr 17, 2017

mreiferson commented Apr 17, 2017

mreiferson commented Apr 17, 2017

slayercat commented Apr 18, 2017

ploxiln commented Apr 18, 2017

slayercat commented Apr 18, 2017

slayercat commented Apr 18, 2017

mreiferson commented Apr 22, 2017

slayercat commented May 2, 2017

ploxiln commented May 2, 2017

ploxiln commented May 2, 2017 • edited Loading

mreiferson commented May 3, 2017

ploxiln commented May 4, 2017

ploxiln commented May 4, 2017

slayercat commented May 11, 2017

slayercat commented Apr 17, 2017 •

edited by mreiferson

Loading

ploxiln commented May 2, 2017 •

edited

Loading