-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(subscriber): mitigate race in Callsites::contains
#474
Conversation
The `ConsoleLayer` uses the `Callsites` struct to store and check for callsites for specific kinds of traces, for example spawn spans or waker events. `Callsites` stores a fixed size array of pointers to the `Metadata` for each callsite and a length to indicate how many callsites it has registered. The length and each individual pointer are stored in atomics. Since it is possible for these values to change individually, if a callsite lookup fails, we check if the length of the array has changed while we were checking the pointers, if it has, the lookup is started again. However, there is still a possible race condition. If the length changes, but the lookup occurs before the callsite pointer is actually written, then we may miss a callsite that is in the process of being registered. In this case, the pointer which is loaded from the `Callsites` array will be null. This change adds a check for this case (null ptr), and reperforms the lookup if it occurs. This race condition was found while chasing down the source of #473. It doesn't solve the flakiness, but it can reduce the likelihood of it occuring, thus it is a mitigation only. In reality, neither of these race condition checks should be needed, as we would expect that `tracing` guarantees that `ConsoleLayer` completes `register_callsite()` before `on_event()` or `new_span()` are called.
return true; | ||
} else if ptr::eq(recorded, ptr::null_mut()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit, take it or leave it: this could be
} else if ptr::eq(recorded, ptr::null_mut()) { | |
} else if recorded.is_null() { |
} else if ptr::eq(recorded, ptr::null_mut()) { | ||
// We have read a recorded callsite before it has been | ||
// written. We need to check again. | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this restarts the whole loop over again. should we, instead, have an inner loop for loading the specific array index until it's no longer null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hawkw After some thinking (and then forgetting about this), I beleive that this change doesn't make sense.
In fact, I'm not sure that the retry mechanism makes sense in general. If tracing
gives the guarantee that Subscriber::register_callsite
will be called (and finish!) before Subscriber::event
get's called, then the retry shouldn't be necessary at all.
If it doesn't give that guarantee, then the retry is increasing our chances of finding the callsite we are interested in during a race condition, but I don't think that it actually solves the problem either.
For now I'm going to close this PR. After looking at tokio-rs/tracing#2743 a bit more I'll revisit.
it would be really nice to have a test that reproduces this raciness. maybe we should add |
Closing this without merging as I don't think it actually makes sense. See #474 (comment) for more details. |
The
ConsoleLayer
uses theCallsites
struct to store and check forcallsites for specific kinds of traces, for example spawn spans or waker
events.
Callsites
stores a fixed size array of pointers to theMetadata
foreach callsite and a length to indicate how many callsites it has
registered. The length and each individual pointer are stored in
atomics.
Since it is possible for these values to change individually, if a
callsite lookup fails, we check if the length of the array has changed
while we were checking the pointers, if it has, the lookup is started
again.
However, there is still a possible race condition. If the length
changes, but the lookup occurs before the callsite pointer is actually
written, then we may miss a callsite that is in the process of being
registered. In this case, the pointer which is loaded from the
Callsites
array will be null.This change adds a check for this case (null ptr), and reperforms the
lookup if it occurs.
This race condition was found while chasing down the source of #473. It
doesn't solve the flakiness, but it can reduce the likelihood of it
occuring, thus it is a mitigation only.
In reality, neither of these race condition checks should be needed, as
we would expect that
tracing
guarantees thatConsoleLayer
completesregister_callsite()
beforeon_event()
ornew_span()
are called.