DeltaScheduler : Unneeded delays on boot #9505

vladsud · 2022-03-16T16:29:09Z

(Update)

Change scope of this item to track two key directions:

Unneeded pausing while loading container. We do not have good here, we should collect it (by moving Container.loaded = true down a bit to include op processing).
Consider moving DeltaSchedule out of runtime, or providing app more control over it, such that experience controls balance between responsiveness and throughput

The issue that was raised - when we catch up and have a lot of ops to process, we waste time for nothing by injecting delays.
Not clear if we can / should differentiate catch up vs. other times we process ops.
But overall, I'd say that using setTimetout(0) should be more than enough.

We also should look at telemetry when and how often that happens.
Does it happen mostly on boot? When do we have so many ops to process and why?

vladsud · 2022-03-24T15:55:09Z

FYI - across all your April bugs, this one is the most important, and rather simple as well.
It's important because we know that timers like that screw up our end-to-end processing times, adding not needed delays.

vladsud · 2022-04-01T22:09:54Z

Something interesting to note:

Office_Fluid_FluidRuntime_Performance
| where Event_Time > ago(7d)
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| where Data_hostScenarioName notcontains "Whiteboard"
| where Data_duration >= 500
| summarize count()

This shows 466 results for me - there are sessions where we processed uninterrupted a batch of ops for longer than 500ms.
If you trace "GetDeltas_OpProcessing" event, you will see that it is raised on inbound queue "idle" event which only fires if we processed (without any interruptions) ops and we drained queue (see DeltaQueue.processDeltas() for details).

But code in DeltaScheduler suggests that should not happen - we should interrupt such long op processing. The only exception - if it's part of a batch, i.e. batch contains many ops and thus DeltaScheduler can't interrupt it.

I'd just added telemetry to learn more about batches - see #9707, so we should learn more soon, but this looks suspicious. And if batches are that long and we do take that long to process them, then it's not very clear if there is a value in DeltaScheduler presence.

Telemetry tracking op processing (#8912, #8947) should also help understand better how we perform

NicholasCouri · 2022-04-07T23:30:02Z

@vladsud - Regarding the Settimeout, unless I'm misreading the code, it is already 0 (packages\runtime\container-runtime\src\deltaScheduler.ts):

this.deltaManager.inbound.pause();
setTimeout(() => { this.deltaManager.inbound.resume(); });

As for the 466, it only represents 0.19% of the total number of GetDeltas_OpProcessing events. Once the new changes are in, we will probably get more info but the only explanation I can think of is due to batching.

Office_Fluid_FluidRuntime_Performance
| where Event_Time > ago(7d)
| where Data_eventName == "fluid:telemetry:OpPerf:GetDeltas_OpProcessing"
| where Data_hostScenarioName notcontains "Whiteboard"
| summarize ct = count() by bin(Data_duration, 500)

Data_duration	ct
0	247708
500	240
1000	73
1500	36
2000	20
2500	11
3000	14
3500	21
4000	5
4500	5
5000	3
5500	2
6000	5
6500	1
7000	6
7500	3
8000	1
8500	1
13500	1
17500	1
24000	1
38500	1
39500	1
426500	1

vladsud · 2022-04-08T00:32:20Z

Hm, interesting, I was under the impression that we keep injecting delays.
Then, I guess there is not much to do here, but we should improve telemetry (as covered by #8909).

vladsud · 2022-04-08T00:37:05Z

I think DeltaScheduler.processingTime is too low, I'd personally increase it to at least 50ms. We will not get 60FPS scrolling / panning with it, but I think if we have such influx of ops, getting current sooner is more important than maintaining 60FPS.

Telemetry will tell us if my intuition is right - we can make this change now, or after we have data.

NicholasCouri · 2022-04-08T03:07:49Z

In the PR I moved to 50 … should I keep it ?

vladsud · 2022-04-08T15:39:28Z

Sure

NicholasCouri · 2022-04-08T16:14:56Z

I was thinking about it and I'm not so sure it is such a good idea. If this query is correct, wouldn't we get only the 25% from the 2 Turns (~15k more instances == ~ 3%) ?

union Office_Fluid_FluidRuntime_*
| where Event_Time > ago(1d)
| where Data_eventName == "fluid:telemetry:ScheduleManager:DeltaScheduler:InboundOpsProcessingTime"
| project-reorder Data_numberOfOps, Data_numberOfTurns, Data_numberOfBatches, Data_processingTime
| summarize Avgt = toint(avg(Data_processingTime)),
P50 = toint(percentile(Data_processingTime, 50)),
Pct75 = toint(percentile(Data_processingTime, 75)),
Pct25 = toint(percentile(Data_processingTime, 25)),
Pct99 = toint(percentile(Data_processingTime, 99)),
Pct99_99 = toint(percentile(Data_processingTime, 99.99)),
NumberOfInstances = count() by Data_numberOfTurns )

Data_numberOfTurns	Avgt	P50	Pct75	Pct25	Pct99	Pct99_99	NumberOfInstances
1	20	2	6	0	585	6766	392981
2	419	166	536	35	2845	21853	56012
3	649	392	697	82	3232	101312	19335
4	756	400	765	125	3661	1195883	8054
5	702	271	900	177	4435	27557	3787
6	780	285	938	242	5514	30754	1939
7	784	352	819	315	4945	29903	1192
8	814	440	843	401	4480	9302	723
9	1030	538	949	492	6517	21722	504
10	1136	648	942	599	7918	11009	351
11	1235	785	1097	719	9183	12109	252
12	1342	904	1274	858	11335	13650	204
13	1462	1048	1197	993	7081	8867	134
14	8593	1196	1397	1145	10679	756929	111
15	1676	1344	1552	1295	8161	8161	75
16	1682	1510	1697	1440	3640	3640	72
17	1980	1771	1994	1662	5268	5268	45
18	3404	2154	2608	1919	34415	34415	32
19	2470	2086	2362	2032	5509	5509	22
20	2894	2335	3364	2286	5785	5785	15
21	3552	2543	2727	2431	17679	17679	17
22	3444	2810	3627	2631	6395	6395	16
23	4204	3475	4151	3052	8018	8018	7
24	4079	3570	4827	3296	5905	5905	12
25	3989	3662	4315	3601	5194	5194	8
26	4688	3904	4089	3736	9288	9288	8
27	46277	4203	17102	3994	240580	240580	7
28	5949	4603	8056	4545	8398	8398	6
29	5101	4701	5420	4693	6504	6504	6
30	7080	7080	7080	7080	7080	7080	1
31	5322	5322	5322	5322	5322	5322	1
32	6073	5941	6442	5836	6442	6442	3
33	9502	9502	9502	9502	9502	9502	1
34	13788	7006	27432	6925	27432	27432	3
35	10600	9156	15657	6988	15657	15657	3
36	7531	7077	7256	6971	8821	8821	4
37	7868	7868	7868	7868	7868	7868	1
38	7753	7558	7948	7558	7948	7948	2
39	9116	9116	9116	9116	9116	9116	1
40	701948	701948	701948	701948	701948	701948	1
42	10593	10593	10593	10593	10593	10593	1
43	10160	10160	10160	10160	10160	10160	1
44	10419	10419	10419	10419	10419	10419	1
46	116289	71518	161060	71518	161060	161060	2
48	13103	13103	13103	13103	13103	13103	1
49	2491221	12977	4969465	12977	4969465	4969465	2
68	96631	96631	96631	96631	96631	96631	1
71							1

	</HTML>

vladsud · 2022-04-08T17:42:52Z

BTW, any data here would be hard to interpret, because ops processing can be paused by external forces, including incomplete batches. Without filtering those cases out, we would have a lot of cases where total processing time (including waits) is large simply because we are waiting for batch to be completed.

vladsud · 2022-04-08T17:51:29Z

I actually think (based on huge jump in times from numberOfTurns = 1 to numberOfTurns =2, and from P50 to P99 when numberOfTurns == 1), that almost all slowness is not due to this code. It's due to incomplete batches coming in, and ScheduleManager / ScheduleManagerCore pausing op processing. If op processing is paused outside, then DeltaScheduler would inject pauses itself when op processing is resumed by ScheduleManager, and thus it will only make things worse.

Let's not track it here, I'll open separate bug to think about it and how we can improve it - both data collection and loging.

agarwal-navin · 2022-04-08T20:49:27Z

There is another reason that why we might be seeing more instances of these long processing times - We changed the default flush mode to TurnBased. This means that earlier, we saw very few batches but now we will see batches almost all the time. The op processing time includes the time spent by external code which may be pretty long (Wasn't there a discussion where Scriptor did some kind of layout rending on batch ends which took a long time?). So, even a small batch (say 100 ops) could end up taking a long time to process and hence the spike.

agarwal-navin · 2022-04-08T20:50:24Z

Should we try to measure this time vs the time spend by the runtime? That may give us a better idea of where most of the time is being spend and we could adjust the processing time in DeltaScheduler accordingly.

vladsud · 2022-04-11T15:56:39Z

Yes, I've opened an issue to track time spend in app callbacks.
We should also move SchedulerManager not to use pause/resume to deal with partial batches (I'll raise PR sometime this week).
This will allow us to measure true impact of DeltaScheduler and understand better balance between responsiveness (injecting delays) and speed of processing.

I think for now we should go with Nicholas PR (including 50ms change) and close this issue and re-evaluate once we have more data.

That all said, I have a feeling that DeltaScheduler should be somehow moved out of runtime and be app concept. I do not think applying responsiveness rules to all workloads is the right thing to do. And even workloads that need it (like document-centric workloads) only need it in very specific phases (for example not on boot when we are trying to catch up before returning container to app). We all should think about possible designs here.

vladsud · 2022-04-12T13:39:42Z

I want to slightly change what this issue tracks and push it out a bit (to free time for higher priority items).

I think this bug should track:

Unneeded pausing while loading container. We do not have good here, we should collect it (by moving Container.loaded = true down a bit to include op processing).
Consider moving DeltaSchedule out of runtime, or providing app more control over it, such that experience controls balance between responsiveness and throughput

Related issues:

DeltaScheduler reports incorrect data due to op processing pausing by ScheduleManager #9792

ghost · 2022-12-01T03:02:24Z

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

vladsud added the bug Something isn't working label Mar 16, 2022

vladsud added this to the April 2022 milestone Mar 16, 2022

vladsud assigned NicholasCouri Mar 16, 2022

This was referenced Mar 25, 2022

Op Processing Latency Improvements #9613

Closed

Improve logging in DeltaScheduler #8909

Closed

Op latency: weird spikes every 2 frames #8908

Closed

vladsud mentioned this issue Apr 8, 2022

DeltaScheduler reports incorrect data due to op processing pausing by ScheduleManager #9792

Closed

NicholasCouri mentioned this issue Apr 8, 2022

Improve logging in DeltaScheduler #9788

Merged

vladsud changed the title ~~DeltaScheduler : We should not inject long delays when processing many ops~~ DeltaScheduler : Unneeded delays on boot Apr 12, 2022

vladsud modified the milestones: April 2022, May 2022 Apr 12, 2022

vladsud modified the milestones: May 2022, June 2022 Apr 30, 2022

vladsud assigned vladsud and unassigned NicholasCouri May 22, 2022

vladsud removed this from the June 2022 milestone Jun 3, 2022

vladsud added this to the Next milestone Jun 3, 2022

ghost added the status: stale label Dec 1, 2022

ghost closed this as completed Dec 9, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeltaScheduler : Unneeded delays on boot #9505

DeltaScheduler : Unneeded delays on boot #9505

vladsud commented Mar 16, 2022 •

edited

Loading

vladsud commented Mar 24, 2022

vladsud commented Apr 1, 2022

NicholasCouri commented Apr 7, 2022 •

edited

Loading

vladsud commented Apr 8, 2022

vladsud commented Apr 8, 2022

NicholasCouri commented Apr 8, 2022 via email •

edited by ghost

Loading

vladsud commented Apr 8, 2022

NicholasCouri commented Apr 8, 2022

vladsud commented Apr 8, 2022

vladsud commented Apr 8, 2022

agarwal-navin commented Apr 8, 2022

agarwal-navin commented Apr 8, 2022

vladsud commented Apr 11, 2022

vladsud commented Apr 12, 2022 •

edited

Loading

ghost commented Dec 1, 2022

DeltaScheduler : Unneeded delays on boot #9505

DeltaScheduler : Unneeded delays on boot #9505

Comments

vladsud commented Mar 16, 2022 • edited Loading

vladsud commented Mar 24, 2022

vladsud commented Apr 1, 2022

NicholasCouri commented Apr 7, 2022 • edited Loading

vladsud commented Apr 8, 2022

vladsud commented Apr 8, 2022

NicholasCouri commented Apr 8, 2022 via email • edited by ghost Loading

vladsud commented Apr 8, 2022

NicholasCouri commented Apr 8, 2022

vladsud commented Apr 8, 2022

vladsud commented Apr 8, 2022

agarwal-navin commented Apr 8, 2022

agarwal-navin commented Apr 8, 2022

vladsud commented Apr 11, 2022

vladsud commented Apr 12, 2022 • edited Loading

ghost commented Dec 1, 2022

vladsud commented Mar 16, 2022 •

edited

Loading

NicholasCouri commented Apr 7, 2022 •

edited

Loading

NicholasCouri commented Apr 8, 2022 via email •

edited by ghost

Loading

vladsud commented Apr 12, 2022 •

edited

Loading