-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track time ops are sitting in inbound queue before they are processed #8912
Comments
Please note that there are many players who can pause op processing:
Given that API is generic and is available to all layers, more users may show up later. I think we should break this item into stages:
Consider opening separate bug for # 2. Or maybe we want to solve it in one go? Not sure. |
We have these stages in OpPerfTelemetry (firing in this order):
This issue is about calculating time between # 3-4 (via leveraging time provided in # 5). Sounds like pretty simple change to make. I'd suggest move toward a model of property bag, vs. many variables tracking various metrics. |
One more thing to add (while thinking about 2 DeltaScheduler issues on your plate): it would be also great to record (if this is possible) how many ops are in inboud queue when given op (we are sampling) is being added to inbound queue. That way we will have better feel on if speed of processing ops is more a function of too many ops in the queue, or it's impacted by code similar to DeltaScheduler that pauses processing of ops). |
Note - MaxBatchWaitTimeExceeded telemetry catches most egregious cases where ops are sitting in inbound queue long (5 seconds), but only because of incomplete batch. It will not capture any other cases where op processing is paused outside of that logic. |
As for the moments we pause-resume the deltamanager inbound queue: @vladsud Should we add to the Summarizer pause/resume as well? I'm assuming we will do the detailed tracking proposed on #2 in a different issue. |
Summarizer should not send ops (other than summarize op). It may participate in some consensus processes, but number of ops here is minimal. So, I'd not worry that much about summarizer showing up in op ack stats, as predominantly (like 99.9%) of ops should be coming from non-summarizer clients. I'd say that I like pausing/resume logic less and less and would rather see us moving away from it. Let's not spend time (for now) to log who pauses queues. Let's use this issue only for tracking op waiting time in inbound queue (i.e. only changes in OpPerfTelemetry). |
I believe this is tracked as durationInboundToProcessing, right? If so, probably time to close this issue and re-evaluate once we have data coming in. |
This is break out from #8911
It tracks # 3 item in the original list:
Track how long ops are sitting in inbound queue until it is processed (that would capture impact of DeltaScheduler & ScheduleManager).
Basically, track time from when "push" event is fired for an op (probably first op in a batch is sufficient) till we start or end of op processing. Maybe track both - time to start processing, and time to process? I think we track later somewhere.
As always - it should be sampled not to produce too much data.
The text was updated successfully, but these errors were encountered: