-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Console reporting excessive polling, that isn't there #378
Comments
Interesting, thanks for the report --- this definitely seems like a bug in the console. Looking at your reproduction (webrtc-rs/webrtc#284), I see you've included a screenshot of the task details view, where a very large number of polls is visible in the poll duration histogram. Is this same poll count shown in the task list view (the main view that's shown when launching the console)? It would be useful to determine whether this is caused by a bug in how the console tracks polls overall, or if it's specific to how we record the poll duration histogram. Thanks! |
The poll count is the same in the task view, I only took the screenshot in the details view to show that the task hasn't been woken. I'm gonna try and debug print the join handle to see if the ID matches that of the task reported in console. |
Hmm, this was quite interesting. Task as reported by let handle = tokio::spawn(trace_future(
Operations::start(l, ops_tx, ops_rx, close_rx),
"Operations::start",
));
log::info!("Started operations task {:?}", handle); is
Console shows two ids:
|
This is expected behavior: these are different task IDs assigned by different sources. The one reported in the task list/task pane in the details view is assigned by the console, while the one in the The console doesn't currently know about the runtime-assigned task IDs generated by Tokio, as these were added quite recently. Although it would be nice to make it aware of them and have the console map its internal IDs to Tokio-assigned task IDs when available, this isn't a bug, as the console will always identify the task by its console-assigned ID. |
Thanks for the explanation. It sounds like that my theory that the data was being reported for the wrong task in console is disproven then. I'm happy to take any pointers on how to continue investigating this further |
The logic for tracking polls is in the console/console-subscriber/src/lib.rs Lines 767 to 850 in da0e972
That's probably a good place to start adding some dbg! s or something, if you're interested in continuing to investigate this.
Alternatively, it could be really useful to see if it's possible to reproduce this behavior outside of Either of those avenues are probably worth pursuing. Thanks for your interest in helping to debug this issue, it's very appreciated! |
Aight, here's what I did: I applied the following patch to diff --git a/console-subscriber/src/aggregator/mod.rs b/console-subscriber/src/aggregator/mod.rs
index cf126b1..ede09a7 100644
--- a/console-subscriber/src/aggregator/mod.rs
+++ b/console-subscriber/src/aggregator/mod.rs
@@ -399,6 +399,13 @@ impl Aggregator {
fields,
location,
} => {
+ println!(
+ "Spawned task with ID: {}, location: {:?}, fields: {:?}",
+ id.into_u64(),
+ location,
+ fields
+ );
+
self.tasks.insert(
id.clone(),
Task {
@@ -449,6 +456,8 @@ impl Aggregator {
task_id,
is_ready,
} => {
+ println!("Polled future with task ID: {}", task_id.into_u64());
+
let poll_op = proto::resources::PollOp {
metadata: Some(metadata.into()),
resource_id: Some(resource_id.into()), Then I ran the reproduction example from webrtc-rs/webrtc#284 for about three minutes. I wrote a small Python script to analyse the resulting logs and spit out the most spawned tasks by joining up the task_id from the poll event with those form the spawned event. find-tasks.pyfrom collections import Counter
import sys
import re
TASK_ID_REGEX = re.compile("Spawned.+ID:\s*(\d+).*")
# Polled future with task ID: 9008298766368777
def extract_id(line):
if line.startswith("Polled future"):
parts = line.split(":")
return int(parts[1])
return None
def extract_task_id(line):
mat = TASK_ID_REGEX.match(line)
if mat is not None:
return int(mat.group(1))
return None
def main():
ids = Counter()
known_tasks = {}
for line in sys.stdin.readlines():
task_id = extract_id(line)
if task_id is not None:
ids[task_id] += 1
continue
spawned_task_id = extract_task_id(line)
if spawned_task_id is not None:
known_tasks[spawned_task_id] = line
most_common = ids.most_common(50)
for (task_id, count, spawn_line) in ((i, c, known_tasks[i]) for (i, c) in most_common):
print(f"{task_id} polled {count} times. Spawned as {spawn_line}")
if __name__ == '__main__':
main() FindingsComparing this output to Here's the log from the above python script Log
and a correspodning screenshot of As you can see here the task that's actually spawned for
When looking at what's reported on the |
I've got it! The "problem", not really sure if it's a problem, is that the operations inside |
Hmm, that definitely seems wrong --- we shouldn't be counting polls of a child task as polls of the task that spawned it. |
Here's how I figured this out: Using the following diff: diff.patchdiff --git a/console-subscriber/src/aggregator/mod.rs b/console-subscriber/src/aggregator/mod.rs
index cf126b1..ede09a7 100644
--- a/console-subscriber/src/aggregator/mod.rs
+++ b/console-subscriber/src/aggregator/mod.rs
@@ -399,6 +399,13 @@ impl Aggregator {
fields,
location,
} => {
+ println!(
+ "Spawned task with ID: {}, location: {:?}, fields: {:?}",
+ id.into_u64(),
+ location,
+ fields
+ );
+
self.tasks.insert(
id.clone(),
Task {
@@ -449,6 +456,8 @@ impl Aggregator {
task_id,
is_ready,
} => {
+ println!("Polled future with task ID: {}", task_id.into_u64());
+
let poll_op = proto::resources::PollOp {
metadata: Some(metadata.into()),
resource_id: Some(resource_id.into()),
diff --git a/console-subscriber/src/lib.rs b/console-subscriber/src/lib.rs
index 748b045..303b8ab 100644
--- a/console-subscriber/src/lib.rs
+++ b/console-subscriber/src/lib.rs
@@ -573,6 +573,11 @@ where
};
(event, stats)
}) {
+ println!(
+ "Attached {:?} to span with id {}, reason: spawn",
+ Arc::as_ptr(&stats),
+ id.into_u64()
+ );
ctx.span(id).expect("if `on_new_span` was called, the span must exist; this is a `tracing` bug!").extensions_mut().insert(stats);
}
return;
@@ -611,6 +616,11 @@ where
};
(event, stats)
}) {
+ println!(
+ "Attached {:?} to span with id {}, reason: resource",
+ Arc::as_ptr(&stats),
+ id.into_u64()
+ );
ctx.span(id).expect("if `on_new_span` was called, the span must exist; this is a `tracing` bug!").extensions_mut().insert(stats);
}
}
@@ -649,6 +659,11 @@ where
(event, stats)
})
{
+ println!(
+ "Attached {:?} to span with id {}, reason: async op",
+ Arc::as_ptr(&stats),
+ id.into_u64()
+ );
ctx.span(id).expect("if `on_new_span` was called, the span must exist; this is a `tracing` bug!").extensions_mut().insert(stats);
}
}
@@ -773,7 +788,13 @@ where
// if the span we are entering is a task or async op, record the
// poll stats.
if let Some(stats) = exts.get::<Arc<stats::TaskStats>>() {
+ let is_parent = at.is_some();
let at = at.unwrap_or_else(Instant::now);
+ println!(
+ "Entered with stats: {:?}, is_parent: {}",
+ Arc::as_ptr(&stats),
+ is_parent
+ );
stats.start_poll(at);
Some(at)
} else if let Some(stats) = exts.get::<Arc<stats::AsyncOpStats>>() {
diff --git a/tokio-console/src/state/tasks.rs b/tokio-console/src/state/tasks.rs
index 8be120d..cc7305e 100644
--- a/tokio-console/src/state/tasks.rs
+++ b/tokio-console/src/state/tasks.rs
@@ -186,6 +186,7 @@ impl TasksState {
warnings: Vec::new(),
location,
};
+ tracing::debug!(?task, "Created new task");
task.lint(linters);
Some((id, task))
}); I again ran the reproduction example for several minutes. Then I figured out the address of the Grepping for that address resulted in by far the most frequent entries and almost all report
Honestly, I'm a bit confused about how this all fits together but it seems like things are getting attributed incorrectly at least |
Here is a simpler program to reproduce the issue. We have two loops "outer loop 5s" and "inner loop 5ms", which just print to stdout and sleep. use std::time::Duration;
use tokio::time::sleep;
#[tokio::main]
async fn main() {
console_subscriber::init();
tokio::task::Builder::new()
.name("outer loop 5s")
.spawn(async move {
tokio::task::Builder::new()
.name("inner loop 5ms")
.spawn(async move {
// inner loop 5ms
loop {
sleep(Duration::from_millis(5)).await;
print!(".");
}
})
.unwrap();
// outer loop 5s
loop {
sleep(Duration::from_secs(5)).await;
println!(".");
}
})
.unwrap();
sleep(Duration::from_secs(120)).await;
} Notice how the outer loop task is reported as being polled every 5ms instead of every 5s. Also, it's reported to spend a lot of time in the scheduled state. |
I've been doing some debugging. I think the issue is indeed inside on_enter and on_exit. if let Some(now) = update(&span, None) {
if let Some(parent) = span.parent() {
update(&parent, Some(now));
}
... The second call to The real question is if I'm breaking something by removing it :) @hawkw, do you remember the reason of that code? It was introduced in #238, I haven't checked yet if the bug was already present before. I don't fully grasp yet the console-subscriber design, so any hint would be appreciated. I'd be happy to open a PR once I'm sure I'm not breaking everything. |
Thanks for looking into this issue @MatteoNardi! I originally suspected that updating the parent span was to do with a task -> resouce hierarchy. But I can't find anything that supports this in the code. We'll wait for Eliza's answer to confirm, but your proposal to remove the parent update makes sense from what I can see. |
When entering and exiting a span the old code was also updating the parent stats. This was causing excessive polling being reported for the parent tasks. See issue #378 for more details. The regression was introduced by the refactor in #238. This fixes the issue by limiting updates to the current span. Closes #378 Co-authored-by: Hayden Stainsby <[email protected]> Co-authored-by: Eliza Weisman <[email protected]>
When entering and exiting a span the old code was also updating the parent stats. This was causing excessive polling being reported for the parent tasks. See issue #378 for more details. The regression was introduced by the refactor in #238. This fixes the issue by limiting updates to the current span. Closes #378 Co-authored-by: Hayden Stainsby <[email protected]> Co-authored-by: Eliza Weisman <[email protected]>
# Changelog All notable changes to this project will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## console-subscriber-v0.2.0 - (2023-09-29) [0b0c1af](https://github.com/tokio-rs/console/commit/0b0c1aff18c3260d3a45a78f6c0d6f4206af1cbb)...[0b0c1af](https://github.com/tokio-rs/console/commit/0b0c1aff18c3260d3a45a78f6c0d6f4206af1cbb) ### <a id = "console-subscriber-v0.2.0-breaking"></a>Breaking Changes - **Update Tonic and Prost dependencies ([#364](#364 ([f9b8e03](https://github.com/tokio-rs/console/commit/f9b8e03bd7ee1d0edb441c94a93a350d5b06ed3b))<br />This commit updates the public dependencies `prost` and `tonic` to semver-incompatible versions (v0.11.0 and v0.8.0, respectively). This is a breaking change for users who are integrating the `console-api` protos with their own `tonic` servers or clients. - **Update `tonic` to v0.10 and increase MSRV to 1.64 ([#464](#464 ([96e62c8](https://github.com/tokio-rs/console/commit/96e62c83ef959569bb062dc8fee98fa2b2461e8d))<br />This is a breaking change for users of `console-api` and `console-subscriber`, as it changes the public `tonic` dependency to a semver-incompatible version. This breaks compatibility with `tonic` 0.9.x and `prost` 0.11.x. ### Added - [**breaking**](#console-subscriber-v0.2.0-breaking) Update Tonic and Prost dependencies ([#364](#364)) ([f9b8e03](f9b8e03)) - Add support for Unix domain sockets ([#388](#388)) ([a944dbc](a944dbc), closes [#296](#296)) - Add scheduled time per task ([#406](#406)) ([f280df9](f280df9)) - Add task scheduled times histogram ([#409](#409)) ([d92a399](d92a399)) - Update `tonic` to 0.9 ([#420](#420)) ([48af1ee](48af1ee)) - Update MSRV to Rust 1.60.0 ([b18ee47](b18ee47)) - Expose server parts ([#451](#451)) ([e51ac5a](e51ac5a)) - Add cfg `console_without_tokio_unstable` ([#446](#446)) ([7ed6673](7ed6673)) - Add warning for tasks that never yield ([#439](#439)) ([d05fa9e](d05fa9e)) - [**breaking**](#console-subscriber-v0.2.0-breaking) Update `tonic` to v0.10 and increase MSRV to 1.64 ([#464](#464)) ([96e62c8](96e62c8)) ### Documented - Fix unclosed code block ([#463](#463)) ([362bdbe](362bdbe)) - Update MSRV version docs to 1.64 ([#467](#467)) ([94a5a51](94a5a51)) ### Fixed - Fix build on tokio 1.21.0 ([#374](#374)) ([c34ac2d](c34ac2d)) - Fix off-by-one indexing for `callsites` ([#391](#391)) ([43891ab](43891ab)) - Bump minimum Tokio version ([#397](#397)) ([bbb8f25](bbb8f25), fixes [#386](#386)) - Fix self wakes count ([#430](#430)) ([d308935](d308935)) - Remove clock skew warning in `start_poll` ([#434](#434)) ([4a88b28](4a88b28)) - Do not report excessive polling ([#378](#378)) ([#440](#440)) ([8b483bf](8b483bf), closes [#378](#378)) - Correct retain logic ([#447](#447)) ([36ffc51](36ffc51)) Signed-off-by: Eliza Weisman <[email protected]>
What crate(s) in this repo are involved in the problem?
tokio-console, console-subscriber
What is the issue?
I've observed that
tokio-console
reports excessive polling, to the tune of around 1000 polls per second, for a simple spawned task that is not being woken.This is the task spawning
https://github.com/webrtc-rs/webrtc/blob/5d05b88de6d3a3eff3cd4e9b57219a25e01d3edc/webrtc/src/peer_connection/operation/mod.rs#L54-L56
and the task itself
https://github.com/webrtc-rs/webrtc/blob/b43e05b574be52b65b648bc367467bb6f14c5e11/webrtc/src/peer_connection/operation/mod.rs#L114-L129
I've observed the task being polled over 1M times over a 24 minute period without having been woken.
How can the bug be reproduced?
I've documented the details and made a small reproducible example in webrtc-rs/webrtc#284
In that PR I've added a printing wrapper around the futures in question as well as introduced more logging. On my machine the excessive polling that tokio-console reports does not line up with the amount of times the futures are actually polled.
Logs, error output, etc
No response
Versions
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
maybe
The text was updated successfully, but these errors were encountered: