-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data race #5029
Fix data race #5029
Conversation
Thanks @dawxy adding to our milestone to be looked at before next release. A quick glance seems simple but as with all concurrency things will need a bit of careful though to reason about whether it could ever be called in a way that will deadlock etc. |
@banks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me and seems necessary.
makeQueryOptionsWithContext is used in several of the watchFuncFactory functions which look to only ever be called from here:
Lines 63 to 71 in 57b5a1d
for !p.shouldStop() { | |
// Invoke the handler | |
blockParamVal, result, err := p.Watcher(p) | |
// Check if we should terminate since the function | |
// could have blocked for a while | |
if p.shouldStop() { | |
break | |
} |
We can't hold the stop lock outside of the factory func that uses the makeQueryOptionsWithContext because the watcher may block for too long and we dont want the other routines attempting to stop the watch to block for that long. The only solution is then for locking where this PR put it or in all the functions that call it (the watchFuncFactory functions).
I see one other problem here which is that we could be setting a cancel function after the watch has already been stopped. Ideally while holding the lock we would check if the plan is stopped and if so then return nil without setting the cancel func. However all the other factory funcs would need to be altered to abort early when nil is returned. When this doesn't happen its possible for a watch to linger for many minutes while waiting for new data even after it was supposed to be stopped.
I will think on whether the second abort early bit is worthwhile. @banks any thoughts?
Is that a problem? Calling cancel on an already-cancelled Context is safe. |
@banks Its possible that comment was misleading. The actual problem is that we could have go routines sitting around for a long time after they should be stopped. If the watch gets stopped before setting the cancel func but after the first stop check here: Lines 63 to 71 in 57b5a1d
Then the watch still gets executed but nothing can ever attempt to cancel it. It isn't a huge problem but we could for example detect while holding the lock that the watch is stopped and execute the new cancel func. That would prevent the watch from hanging around too long. |
@mkeeler |
@mkeeler PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. This should fix the concurrency issues as well as ensure that we don't let the new watch hang around for a long time when it should be stopped.
Fix #4357