-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow register_callsite()
can cause other subscribers to not receive register_callsite()
at all
#2743
Labels
Comments
hds
added a commit
that referenced
this issue
Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local subscribers interfer so that when the first one takes too long to complete `register_callsite()`, the second subscriber doesn't have `register_callsite()` called at all. It also contains an attempt at a fix by removing a race condition in `DefaultCallsite::register`. This fix may be necessary anyway, but it isn't sufficient. The issue for this problem is #2743.
hds
added a commit
that referenced
this issue
Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local subscribers interfer so that when the first one takes too long to complete `register_callsite()`, the second subscriber doesn't have `register_callsite()` called at all. It also contains an attempt at a fix by removing a race condition in `DefaultCallsite::register`. This fix may be necessary anyway, but it isn't sufficient. The issue for this problem is #2743.
hds
added a commit
that referenced
this issue
Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local subscribers interfer so that when the first one takes too long to complete `register_callsite()`, the second subscriber doesn't have `register_callsite()` called at all. It also contains an attempt at a fix by removing a race condition in `DefaultCallsite::register`. This fix may be necessary anyway, but it isn't sufficient. The issue for this problem is #2743.
hds
added a commit
that referenced
this issue
Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local subscribers interfer so that when the first one takes too long to complete `register_callsite()`, the second subscriber doesn't have `register_callsite()` called at all. It also contains an attempt at a fix by removing a race condition in `DefaultCallsite::register`. This fix may be necessary anyway, but it isn't sufficient. The issue for this problem is #2743.
hds
added a commit
that referenced
this issue
Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local subscribers interfer so that when the first one takes too long to complete `register_callsite()`, the second subscriber doesn't have `register_callsite()` called at all. It also contains an attempt at a fix by removing a race condition in `DefaultCallsite::register`. This fix may be necessary anyway, but it isn't sufficient. The issue for this problem is #2743.
Thanks for the repro, I can look into the potential fix and see what else is necessary to make it pass! |
hawkw
added
kind/bug
Something isn't working
crate/core
Related to the `tracing-core` crate
labels
Oct 10, 2023
This was referenced Nov 16, 2023
hds
added a commit
to tokio-rs/console
that referenced
this issue
Apr 10, 2024
We have identified that some of the `console-subscriber` integration tests are flaky (#473). This appears to be a result of the following issue in `tracing` tokio-rs/tracing#2743. Flaky tests are really worse than no tests. So these tests will be removed until the flakiness can be fixed.
hds
added a commit
to tokio-rs/console
that referenced
this issue
Apr 10, 2024
We have identified that some of the `console-subscriber` integration tests are flaky (#473). This appears to be a result of the following issue in `tracing` tokio-rs/tracing#2743. Flaky tests are really worse than no tests. So these tests will be removed until the flakiness can be fixed.
hds
added a commit
that referenced
this issue
Apr 16, 2024
There are 2 triggers which will cause a subscriber to receive a call to `Subscriber::register_callsite` for a specific callsite. 1. The first time the event or span at that callsite is executed. 2. When a new subscriber is added or removed (for example, calls to `set_default` or `with_default`) It is trigger (2) that will cause a new subscriber to receive `Subscriber::register_callsite` for all the callsites which had already been registered before it became active. When a callsite is registered for trigger (1), the callsite starts in state `UNREGISTERED`. The first thread to encounter the callsite will transition it to `REGISTERING` and determine the overall interest for the callsite by registering with all known dispatchers (which will call into `Subscriber::register_callsite`). Once that is complete, the callsite is added to the list of all known callsites and its state is transitioned to `REGISTERED`. is (re)built for all known dispatchers. The callsite starts in state `UNREGISTERED`. The This calls down into `Subscriber::register_callsite` for each subscriber. Once that is complete, the callsite is added to the global list of known callsites. While the callsite interest is being rebuilt, other threads that encounter the callsite will be given `Interest::sometimes()` until the registration is complete. However, if a new subscriber is added during this window, all the interest for all callsites will be rebuilt, but because the new callsite (in state `REGISTERING`) won't be included because it isn't yet in the global list of callsites. This can cause a case where that new subscriber being added won't receive `Subscriber::register_callsite` before it receives the subsequent call to `Subscriber::event` or `Subscriber::new_span`. The documentation on [Registering Callsites] is not very explicit on this point, but it does suggest that `Subscriber::register_callsite` will be called before the call to either `Subscriber::event` or `Subscriber::new_span`, and the current behavior can break this implicit contract. [Registering Callsites]: https://docs.rs/tracing-core/0.1.32/tracing_core/callsite/index.html#registering-callsites This change swaps the order of rebuilding the callsite interest and adding the callsite to the global list so that the callsite gets pushed first, avoiding this window in which a subscriber won't get a call to `register_callsite`. As such, a callsite may have its interest read before it is set. In this case, the existing implementation will return `Interest::sometimes()` for the `DefaultCallsite` implementation. Other implementations (outside of the `tracing` project) may perform this differently, but in this case, there is no documented guarantee regarding the ordering. A regression test is included which provokes the race condition 100% of the time before the changes in this fix. Fixes: #2743
hds
added a commit
that referenced
this issue
Nov 25, 2024
There are 2 triggers which will cause a subscriber to receive a call to `Subscriber::register_callsite` for a specific callsite. 1. The first time the event or span at that callsite is executed. 2. When a new subscriber is added or removed (for example, calls to `set_default` or `with_default`) It is trigger (2) that will cause a new subscriber to receive `Subscriber::register_callsite` for all the callsites which had already been registered before it became active. When a callsite is registered for trigger (1), the callsite starts in state `UNREGISTERED`. The first thread to encounter the callsite will transition it to `REGISTERING` and determine the overall interest for the callsite by registering with all known dispatchers (which will call into `Subscriber::register_callsite`). Once that is complete, the callsite is added to the list of all known callsites and its state is transitioned to `REGISTERED`. is (re)built for all known dispatchers. The callsite starts in state `UNREGISTERED`. The This calls down into `Subscriber::register_callsite` for each subscriber. Once that is complete, the callsite is added to the global list of known callsites. While the callsite interest is being rebuilt, other threads that encounter the callsite will be given `Interest::sometimes()` until the registration is complete. However, if a new subscriber is added during this window, all the interest for all callsites will be rebuilt, but because the new callsite (in state `REGISTERING`) won't be included because it isn't yet in the global list of callsites. This can cause a case where that new subscriber being added won't receive `Subscriber::register_callsite` before it receives the subsequent call to `Subscriber::event` or `Subscriber::new_span`. The documentation on [Registering Callsites] is not very explicit on this point, but it does suggest that `Subscriber::register_callsite` will be called before the call to either `Subscriber::event` or `Subscriber::new_span`, and the current behavior can break this implicit contract. [Registering Callsites]: https://docs.rs/tracing-core/0.1.32/tracing_core/callsite/index.html#registering-callsites This change swaps the order of rebuilding the callsite interest and adding the callsite to the global list so that the callsite gets pushed first, avoiding this window in which a subscriber won't get a call to `register_callsite`. As such, a callsite may have its interest read before it is set. In this case, the existing implementation will return `Interest::sometimes()` for the `DefaultCallsite` implementation. Other implementations (outside of the `tracing` project) may perform this differently, but in this case, there is no documented guarantee regarding the ordering. A regression test is included which provokes the race condition 100% of the time before the changes in this fix. Fixes: #2743 Co-authored-by: David Barsky <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Bug Report
There is code below that reproduces this issue consistently (increase the sleep time to make it 100%).
I've uploaded a branch containing the test that reproduces the issue and my first attempt at a fix (which doesn't appear to fix anything). There's a draft PR open against
v0.1.x
to show the different #2744.Version
tracing-core
: 0.1.30tracing
: 0.1.37 (not necessary, can also be reproduced from justtracing-core
)Also confirmed present on the
v0.1.x
branch.This error is not reproducible on
master
.Platform
Linux and macOS confirmed. Have not tested on Windows.
Linux <name> 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Darwin <name> 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:31 PST 2024; root:xnu-10063.101.15~2/RELEASE_X86_64 x86_64
Crates
tracing-core
Description
If a
Subscriber
'sregister_callsite()
function takes too long to complete, it can cause other subscribers to not haveregister_callsite
called at all. Even if the second subscriber is the one which is then recording the tracing event at that callsite.a. Registers a local
Subscriber
(viatracing::subscriber::set_default
)b. Records a single event.
When the first subscriber takes too long inside
register_callsite()
, the second subscriber doesn't receive a call toregister_callsite()
at all, instead it receives a call toevent()
directly.What is too long?
By ensuring the order in which the threads start executing, a
thread::sleep
of 50 microseconds willI can reliably reproduce this issue with the code below with a
thread::sleep
of 50 microseconds, although it will sometimes succeed. I have seen it reproduce with a sleep of only 20 microseconds, although that was less frequent (3 failures out of 10 tries just now). The machine is a 6 core 3.xGHz Dell desktop that's about 6 years old - so it's not top of the line, but also not underpowered. Keep in mind that the test is callingprintln!
, so that may be adding additional delay.Here is a full reproduction of the issue as an integration test is below. Note that the assert that would actually make this test fail has been commented out so that it can be observed that
subscriber-2
doesn't receive a call toregister_callsite()
at all (it is not that it receives it after the call toevent()
which I had originally suspected).If we add
println!
messages to theregister_callsite
andevent
methods, we get the following output:When the sleep in the first subscriber is set to 0, the expected behavior is observed:
My best guess is that the reregistration of the callsite (because there's a new subscriber
subscriber-2
in the new thread) causessubscriber-2
to skip registering the callsite of the event when it executes it.A description of the fix can be found in #2938.
The text was updated successfully, but these errors were encountered: