-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsqd: ephemeral channel race condition with reconnecting client #886
Conversation
The pem file in test is out-dated. so the check failed. |
I've got to think through this one, thanks for the fix though @slayercat! I'll also fix the TLS certs... |
Fixed TLS certs in 932b3fa |
4c42cab
to
b3d3769
Compare
I've just sent a new commit to make the global lock small enough. |
The comment is now out-of-date:
I haven't thought enough about whether we still need that ordering ... @mreiferson would know better than me anyway |
Commit a0a6ce6 imports the change. I'm not sure what's this commit for, but seems it make sense to disable client connect to topic if the topic is been destroyed ===== commit a0a6ce6
|
How about this?
|
OK, I poked around at this. I think Perhaps what we need is a function at the topic level that acquires a lock and performs these three actions? |
0eb302c
to
5bcb654
Compare
Commited a new PR for review. |
…ent. It's possible for a client reconnecting quickly and subscribed to an ephemeral channel to race with nsqd's cleanup of said ephemeral channel, as documented in nsqio/go-nsq#206. Fixes nsqio#883
For a sleep-retry loop, one microsecond is too short. A context switch can take a couple of microseconds. I suggest a millisecond. (Maybe even 10ms). I think this does not quite prevent the race, though it might narrow it down significantly. It's still theoretically possible for the channel to start exiting between the check I have an idea which may help: check for The |
I take back the suggestion I wrote in the second half of my comment, I don't think that sufficiently improves the current PR. Maybe the best way, without changing too much of the locking and structure of Topic and Channel, is to:
EDIT: tombstone check might not be relevant here |
I think @ploxiln's last proposal sounds like a reasonable "lock-free" approach. My original proposal still stands, which is to promote these 3 independent operations to a single operation at either the topic or channel level. |
I think that the "correct locking" option, "promote to a single operation", would require a single mutex for the whole NSQD to be locked when any consumer connects or disconnects / is cleaned-up on any channel which is ephemeral. If the lock is per-channel, in the Channel struct, the new consumer could get the channel and try to lock it just after the previous consumer locks it. The new consumer will finally get the lock after the Channel is already exited (just not yet garbage collected), and will have to loop and retry, hoping to get a brand new ephemeral Channel with the same name. If the lock is per-topic, in the Topic struct, the same logic applies if the topic is also ephemeral - a new consumer might get the topic and try to lock it, but only actually gain ownership of the lock after the topic is already doomed, and will need to loop and retry. Thus, a whole-NSQD "ephemeral lock" will need to be held, to avoid a consumer connection even calling |
I don't think my scheme would require consumers connecting to non-ephemeral topics and channels to take any such lock, because non-ephemeral topics and channels are only deliberately deleted by the operator, and immediate resurrection is not appropriate for them. |
#891 fixs the issue. |
It's possible for a client reconnecting quickly and subscribed to an ephemeral channel to race with nsqd's cleanup of said ephemeral channel, as documented in nsqio/go-nsq#206.
Fixes #883