Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeltaScheduler : Unneeded delays on boot #9505

Closed
Tracked by #9613
vladsud opened this issue Mar 16, 2022 · 15 comments
Closed
Tracked by #9613

DeltaScheduler : Unneeded delays on boot #9505

vladsud opened this issue Mar 16, 2022 · 15 comments
Assignees
Labels
bug Something isn't working status: stale
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Mar 16, 2022

(Update)

Change scope of this item to track two key directions:

  1. Unneeded pausing while loading container. We do not have good here, we should collect it (by moving Container.loaded = true down a bit to include op processing).
  2. Consider moving DeltaSchedule out of runtime, or providing app more control over it, such that experience controls balance between responsiveness and throughput

Teams thread

The issue that was raised - when we catch up and have a lot of ops to process, we waste time for nothing by injecting delays.
Not clear if we can / should differentiate catch up vs. other times we process ops.
But overall, I'd say that using setTimetout(0) should be more than enough.

We also should look at telemetry when and how often that happens.
Does it happen mostly on boot? When do we have so many ops to process and why?

@vladsud vladsud added the bug Something isn't working label Mar 16, 2022
@vladsud vladsud added this to the April 2022 milestone Mar 16, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Mar 24, 2022

FYI - across all your April bugs, this one is the most important, and rather simple as well.
It's important because we know that timers like that screw up our end-to-end processing times, adding not needed delays.

@vladsud
Copy link
Contributor Author

vladsud commented Apr 1, 2022

Something interesting to note:

Office_Fluid_FluidRuntime_Performance
| where Event_Time > ago(7d)
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| where Data_hostScenarioName notcontains "Whiteboard"
| where Data_duration >= 500
| summarize count()

This shows 466 results for me - there are sessions where we processed uninterrupted a batch of ops for longer than 500ms.
If you trace "GetDeltas_OpProcessing" event, you will see that it is raised on inbound queue "idle" event which only fires if we processed (without any interruptions) ops and we drained queue (see DeltaQueue.processDeltas() for details).

But code in DeltaScheduler suggests that should not happen - we should interrupt such long op processing. The only exception - if it's part of a batch, i.e. batch contains many ops and thus DeltaScheduler can't interrupt it.

I'd just added telemetry to learn more about batches - see #9707, so we should learn more soon, but this looks suspicious. And if batches are that long and we do take that long to process them, then it's not very clear if there is a value in DeltaScheduler presence.

Telemetry tracking op processing (#8912, #8947) should also help understand better how we perform

@NicholasCouri
Copy link
Contributor

NicholasCouri commented Apr 7, 2022

@vladsud - Regarding the Settimeout, unless I'm misreading the code, it is already 0 (packages\runtime\container-runtime\src\deltaScheduler.ts):

this.deltaManager.inbound.pause();
setTimeout(() => { this.deltaManager.inbound.resume(); });

As for the 466, it only represents 0.19% of the total number of GetDeltas_OpProcessing events. Once the new changes are in, we will probably get more info but the only explanation I can think of is due to batching.

Office_Fluid_FluidRuntime_Performance
| where Event_Time > ago(7d)
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| where Data_hostScenarioName notcontains "Whiteboard"
| summarize ct = count() by bin(Data_duration, 500)

Data_duration ct
0 247708
500 240
1000 73
1500 36
2000 20
2500 11
3000 14
3500 21
4000 5
4500 5
5000 3
5500 2
6000 5
6500 1
7000 6
7500 3
8000 1
8500 1
13500 1
17500 1
24000 1
38500 1
39500 1
426500 1

@vladsud
Copy link
Contributor Author

vladsud commented Apr 8, 2022

Hm, interesting, I was under the impression that we keep injecting delays.
Then, I guess there is not much to do here, but we should improve telemetry (as covered by #8909).

@vladsud
Copy link
Contributor Author

vladsud commented Apr 8, 2022

I think DeltaScheduler.processingTime is too low, I'd personally increase it to at least 50ms. We will not get 60FPS scrolling / panning with it, but I think if we have such influx of ops, getting current sooner is more important than maintaining 60FPS.

Telemetry will tell us if my intuition is right - we can make this change now, or after we have data.

@NicholasCouri
Copy link
Contributor

NicholasCouri commented Apr 8, 2022 via email

@vladsud
Copy link
Contributor Author

vladsud commented Apr 8, 2022

Sure

@NicholasCouri
Copy link
Contributor

I was thinking about it and I'm not so sure it is such a good idea. If this query is correct, wouldn't we get only the 25% from the 2 Turns (~15k more instances == ~ 3%) ?

union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(1d)
| where Data_eventName == "fluid:telemetry:ScheduleManager:DeltaScheduler:InboundOpsProcessingTime"
| project-reorder Data_numberOfOps, Data_numberOfTurns, Data_numberOfBatches, Data_processingTime
| summarize Avgt = toint(avg(Data_processingTime)),
P50 = toint(percentile(Data_processingTime, 50)),
Pct75 = toint(percentile(Data_processingTime, 75)),
Pct25 = toint(percentile(Data_processingTime, 25)),
Pct99 = toint(percentile(Data_processingTime, 99)),
Pct99_99 = toint(percentile(Data_processingTime, 99.99)),
NumberOfInstances = count() by Data_numberOfTurns )

Data_numberOfTurns Avgt P50 Pct75 Pct25 Pct99 Pct99_99 NumberOfInstances
1 20 2 6 0 585 6766 392981
2 419 166 536 35 2845 21853 56012
3 649 392 697 82 3232 101312 19335
4 756 400 765 125 3661 1195883 8054
5 702 271 900 177 4435 27557 3787
6 780 285 938 242 5514 30754 1939
7 784 352 819 315 4945 29903 1192
8 814 440 843 401 4480 9302 723
9 1030 538 949 492 6517 21722 504
10 1136 648 942 599 7918 11009 351
11 1235 785 1097 719 9183 12109 252
12 1342 904 1274 858 11335 13650 204
13 1462 1048 1197 993 7081 8867 134
14 8593 1196 1397 1145 10679 756929 111
15 1676 1344 1552 1295 8161 8161 75
16 1682 1510 1697 1440 3640 3640 72
17 1980 1771 1994 1662 5268 5268 45
18 3404 2154 2608 1919 34415 34415 32
19 2470 2086 2362 2032 5509 5509 22
20 2894 2335 3364 2286 5785 5785 15
21 3552 2543 2727 2431 17679 17679 17
22 3444 2810 3627 2631 6395 6395 16
23 4204 3475 4151 3052 8018 8018 7
24 4079 3570 4827 3296 5905 5905 12
25 3989 3662 4315 3601 5194 5194 8
26 4688 3904 4089 3736 9288 9288 8
27 46277 4203 17102 3994 240580 240580 7
28 5949 4603 8056 4545 8398 8398 6
29 5101 4701 5420 4693 6504 6504 6
30 7080 7080 7080 7080 7080 7080 1
31 5322 5322 5322 5322 5322 5322 1
32 6073 5941 6442 5836 6442 6442 3
33 9502 9502 9502 9502 9502 9502 1
34 13788 7006 27432 6925 27432 27432 3
35 10600 9156 15657 6988 15657 15657 3
36 7531 7077 7256 6971 8821 8821 4
37 7868 7868 7868 7868 7868 7868 1
38 7753 7558 7948 7558 7948 7948 2
39 9116 9116 9116 9116 9116 9116 1
40 701948 701948 701948 701948 701948 701948 1
42 10593 10593 10593 10593 10593 10593 1
43 10160 10160 10160 10160 10160 10160 1
44 10419 10419 10419 10419 10419 10419 1
46 116289 71518 161060 71518 161060 161060 2
48 13103 13103 13103 13103 13103 13103 1
49 2491221 12977 4969465 12977 4969465 4969465 2
68 96631 96631 96631 96631 96631 96631 1
71             1
	</HTML>

@vladsud
Copy link
Contributor Author

vladsud commented Apr 8, 2022

BTW, any data here would be hard to interpret, because ops processing can be paused by external forces, including incomplete batches. Without filtering those cases out, we would have a lot of cases where total processing time (including waits) is large simply because we are waiting for batch to be completed.

@vladsud
Copy link
Contributor Author

vladsud commented Apr 8, 2022

I actually think (based on huge jump in times from numberOfTurns = 1 to numberOfTurns =2, and from P50 to P99 when numberOfTurns == 1), that almost all slowness is not due to this code. It's due to incomplete batches coming in, and ScheduleManager / ScheduleManagerCore pausing op processing. If op processing is paused outside, then DeltaScheduler would inject pauses itself when op processing is resumed by ScheduleManager, and thus it will only make things worse.

Let's not track it here, I'll open separate bug to think about it and how we can improve it - both data collection and loging.

@agarwal-navin
Copy link
Contributor

There is another reason that why we might be seeing more instances of these long processing times - We changed the default flush mode to TurnBased. This means that earlier, we saw very few batches but now we will see batches almost all the time. The op processing time includes the time spent by external code which may be pretty long (Wasn't there a discussion where Scriptor did some kind of layout rending on batch ends which took a long time?). So, even a small batch (say 100 ops) could end up taking a long time to process and hence the spike.

@agarwal-navin
Copy link
Contributor

Should we try to measure this time vs the time spend by the runtime? That may give us a better idea of where most of the time is being spend and we could adjust the processing time in DeltaScheduler accordingly.

@vladsud
Copy link
Contributor Author

vladsud commented Apr 11, 2022

Yes, I've opened an issue to track time spend in app callbacks.
We should also move SchedulerManager not to use pause/resume to deal with partial batches (I'll raise PR sometime this week).
This will allow us to measure true impact of DeltaScheduler and understand better balance between responsiveness (injecting delays) and speed of processing.

I think for now we should go with Nicholas PR (including 50ms change) and close this issue and re-evaluate once we have more data.

That all said, I have a feeling that DeltaScheduler should be somehow moved out of runtime and be app concept. I do not think applying responsiveness rules to all workloads is the right thing to do. And even workloads that need it (like document-centric workloads) only need it in very specific phases (for example not on boot when we are trying to catch up before returning container to app). We all should think about possible designs here.

@vladsud
Copy link
Contributor Author

vladsud commented Apr 12, 2022

I want to slightly change what this issue tracks and push it out a bit (to free time for higher priority items).

I think this bug should track:

  1. Unneeded pausing while loading container. We do not have good here, we should collect it (by moving Container.loaded = true down a bit to include op processing).
  2. Consider moving DeltaSchedule out of runtime, or providing app more control over it, such that experience controls balance between responsiveness and throughput

Related issues:

@vladsud vladsud changed the title DeltaScheduler : We should not inject long delays when processing many ops DeltaScheduler : Unneeded delays on boot Apr 12, 2022
@vladsud vladsud modified the milestones: April 2022, May 2022 Apr 12, 2022
@vladsud vladsud modified the milestones: May 2022, June 2022 Apr 30, 2022
@vladsud vladsud assigned vladsud and unassigned NicholasCouri May 22, 2022
@vladsud vladsud removed this from the June 2022 milestone Jun 3, 2022
@vladsud vladsud added this to the Next milestone Jun 3, 2022
@ghost ghost added the status: stale label Dec 1, 2022
@ghost
Copy link

ghost commented Dec 1, 2022

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@ghost ghost closed this as completed Dec 9, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status: stale
Projects
None yet
Development

No branches or pull requests

3 participants