Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track time ops are sitting in inbound queue before they are processed #8912

Closed
Tracked by #9613
vladsud opened this issue Jan 27, 2022 · 7 comments · Fixed by #9788
Closed
Tracked by #9613

Track time ops are sitting in inbound queue before they are processed #8912

vladsud opened this issue Jan 27, 2022 · 7 comments · Fixed by #9788
Assignees
Labels
bug Something isn't working
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Jan 27, 2022

This is break out from #8911

It tracks # 3 item in the original list:
Track how long ops are sitting in inbound queue until it is processed (that would capture impact of DeltaScheduler & ScheduleManager).

Basically, track time from when "push" event is fired for an op (probably first op in a batch is sufficient) till we start or end of op processing. Maybe track both - time to start processing, and time to process? I think we track later somewhere.

As always - it should be sampled not to produce too much data.

@vladsud vladsud added the bug Something isn't working label Jan 27, 2022
@vladsud vladsud added this to the February 2022 milestone Jan 27, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Jan 30, 2022

Please note that there are many players who can pause op processing:

  1. DeltaScheduler
  2. ScheduleManager
  3. Summarizer
  4. Container (code proposal)

Given that API is generic and is available to all layers, more users may show up later.

I think we should break this item into stages:

  1. Immediate: Simply track time how long ops are paused (sampled)
  2. Slowly: Add a system to track who's pausing ops, and a way to expose first/last/maybe all actors who caused ops to sit paused
    • I'd suggest for pause() to take some token (string, better - assert tag, clearly representing call site) and return another token (coutner). And for resume() to require token returned from pause() to match them and keep tracking who keeps queue paused. That would give us full picture and enforcement that pause() & resume are properly paired (and if not - we always know the offender)
    • A simpler system might be for resume() to accept token (like assert tag) to clearly articulate who was the last one to resume op processing, and thus we can log it with duration. This is less powerful system, but much simpler to implement for this task.

Consider opening separate bug for # 2. Or maybe we want to solve it in one go? Not sure.
I can help distributing work if we have individual items opened.

@vladsud
Copy link
Contributor Author

vladsud commented Mar 30, 2022

We have these stages in OpPerfTelemetry (firing in this order):

  1. "submitOp" event
  2. DeltaManager 's outbound "push" event
  3. DeltaManager's inbound "push" event
  4. Start of op processing
    • no event, as far as I can see, but next event can be used to calculate this point in time.
  5. DeltaManager "op" event
    • This event reports time to process an op - see IDeltaManagerEvents & code in DeltaManager.processInboundMessage

This issue is about calculating time between # 3-4 (via leveraging time provided in # 5).
I'd add that it would be also great to record time it takes to do #5.

Sounds like pretty simple change to make.

I'd suggest move toward a model of property bag, vs. many variables tracking various metrics.
I.e. there is empty bag created in # 1, and as we go through stages, we keep adding properties to this bag, and whole bag is recorded at the end. We still need some tracking variables, but something like durationSittingInInboundQueue (renamed to be durationInboundQueue) should be a property in this bag.

@vladsud
Copy link
Contributor Author

vladsud commented Mar 30, 2022

One more thing to add (while thinking about 2 DeltaScheduler issues on your plate): it would be also great to record (if this is possible) how many ops are in inboud queue when given op (we are sampling) is being added to inbound queue. That way we will have better feel on if speed of processing ops is more a function of too many ops in the queue, or it's impacted by code similar to DeltaScheduler that pauses processing of ops).

@vladsud
Copy link
Contributor Author

vladsud commented Apr 1, 2022

Note - MaxBatchWaitTimeExceeded telemetry catches most egregious cases where ops are sitting in inbound queue long (5 seconds), but only because of incomplete batch. It will not capture any other cases where op processing is paused outside of that logic.

@NicholasCouri
Copy link
Contributor

NicholasCouri commented Apr 9, 2022

As for the moments we pause-resume the deltamanager inbound queue:
a) The DeltaScheduler we will be getting from #9788
b) The Container-Runtime we have the BatchStats you've added

@vladsud Should we add to the Summarizer pause/resume as well?

I'm assuming we will do the detailed tracking proposed on #2 in a different issue.

@vladsud
Copy link
Contributor Author

vladsud commented Apr 11, 2022

Summarizer should not send ops (other than summarize op). It may participate in some consensus processes, but number of ops here is minimal. So, I'd not worry that much about summarizer showing up in op ack stats, as predominantly (like 99.9%) of ops should be coming from non-summarizer clients.

I'd say that I like pausing/resume logic less and less and would rather see us moving away from it. Let's not spend time (for now) to log who pauses queues. Let's use this issue only for tracking op waiting time in inbound queue (i.e. only changes in OpPerfTelemetry).

@vladsud
Copy link
Contributor Author

vladsud commented Apr 11, 2022

I believe this is tracked as durationInboundToProcessing, right? If so, probably time to close this issue and re-evaluate once we have data coming in.

@NicholasCouri NicholasCouri linked a pull request Apr 11, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants