-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simtime alarm race condition #12585
Comments
cc @antoniovicente not sure if this was pre-existing or introduced as part of your recent changes. @jmarantz is going to take a look. |
Repro with the associated PR and:
|
Repro (I think) in a unit test:
Failed for we with debug+tsan 14/100 times. Works 100% of the time if I swap the last 2 lines of the test. |
After talking to jmarantz offline and looking through the code more closely, I think the issue is related to SetMonotonicTime doing activate calls on multiple dispatchers on objects that are owned by their respective dispatchers and may be on any stage of their lifetime. The call stack on this bug is a good example: A call to Alarm::activateLockHeld that races with the destructor for the same alarm. A possible solution to this is changing the way simulated time is advanced. A possible solution would be to keep a separate collection of alarms per dispatcher, and have SetMonotonicTime schedule an event that moves the time forward and activates any alarms that are past the expiration point from within each of the dispatchers. By having the activate calls in the dispatchers we avoid the race between activate and deletion. We should also be able to greatly simplify or even eliminate locking in simulated time which should greatly simplify the implementation. |
/assign @antoniovicente |
Thanks @antoniovicente |
Yeah this is an interesting option and I think makes sense. |
avd's draft impl: #12609 |
Fixed by #12609 |
This got exposed and/or became easier to repro via #12527. Basically the alarm is getting deleted in parallel to it being scheduled. We need to rethink the locking here to make sure the time system mutex is held across all operations and objects are kept alive if already scheduled and the callback should just not be called, etc.
The text was updated successfully, but these errors were encountered: