-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeltaScheduler : Unneeded delays on boot #9505
Comments
FYI - across all your April bugs, this one is the most important, and rather simple as well. |
Something interesting to note:
This shows 466 results for me - there are sessions where we processed uninterrupted a batch of ops for longer than 500ms. But code in DeltaScheduler suggests that should not happen - we should interrupt such long op processing. The only exception - if it's part of a batch, i.e. batch contains many ops and thus DeltaScheduler can't interrupt it. I'd just added telemetry to learn more about batches - see #9707, so we should learn more soon, but this looks suspicious. And if batches are that long and we do take that long to process them, then it's not very clear if there is a value in DeltaScheduler presence. Telemetry tracking op processing (#8912, #8947) should also help understand better how we perform |
@vladsud - Regarding the Settimeout, unless I'm misreading the code, it is already 0 (packages\runtime\container-runtime\src\deltaScheduler.ts):
As for the 466, it only represents 0.19% of the total number of GetDeltas_OpProcessing events. Once the new changes are in, we will probably get more info but the only explanation I can think of is due to batching. Office_Fluid_FluidRuntime_Performance
|
Hm, interesting, I was under the impression that we keep injecting delays. |
I think DeltaScheduler.processingTime is too low, I'd personally increase it to at least 50ms. We will not get 60FPS scrolling / panning with it, but I think if we have such influx of ops, getting current sooner is more important than maintaining 60FPS. Telemetry will tell us if my intuition is right - we can make this change now, or after we have data. |
In the PR I moved to 50 … should I keep it ?
|
Sure |
I was thinking about it and I'm not so sure it is such a good idea. If this query is correct, wouldn't we get only the 25% from the 2 Turns (~15k more instances == ~ 3%) ? union Office_Fluid_FluidRuntime_*
|
BTW, any data here would be hard to interpret, because ops processing can be paused by external forces, including incomplete batches. Without filtering those cases out, we would have a lot of cases where total processing time (including waits) is large simply because we are waiting for batch to be completed. |
I actually think (based on huge jump in times from numberOfTurns = 1 to numberOfTurns =2, and from P50 to P99 when numberOfTurns == 1), that almost all slowness is not due to this code. It's due to incomplete batches coming in, and ScheduleManager / ScheduleManagerCore pausing op processing. If op processing is paused outside, then DeltaScheduler would inject pauses itself when op processing is resumed by ScheduleManager, and thus it will only make things worse. Let's not track it here, I'll open separate bug to think about it and how we can improve it - both data collection and loging. |
There is another reason that why we might be seeing more instances of these long processing times - We changed the default flush mode to TurnBased. This means that earlier, we saw very few batches but now we will see batches almost all the time. The op processing time includes the time spent by external code which may be pretty long (Wasn't there a discussion where Scriptor did some kind of layout rending on batch ends which took a long time?). So, even a small batch (say 100 ops) could end up taking a long time to process and hence the spike. |
Should we try to measure this time vs the time spend by the runtime? That may give us a better idea of where most of the time is being spend and we could adjust the processing time in DeltaScheduler accordingly. |
Yes, I've opened an issue to track time spend in app callbacks. I think for now we should go with Nicholas PR (including 50ms change) and close this issue and re-evaluate once we have more data. That all said, I have a feeling that DeltaScheduler should be somehow moved out of runtime and be app concept. I do not think applying responsiveness rules to all workloads is the right thing to do. And even workloads that need it (like document-centric workloads) only need it in very specific phases (for example not on boot when we are trying to catch up before returning container to app). We all should think about possible designs here. |
I want to slightly change what this issue tracks and push it out a bit (to free time for higher priority items). I think this bug should track:
Related issues: |
This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
(Update)
Change scope of this item to track two key directions:
Teams thread
The issue that was raised - when we catch up and have a lot of ops to process, we waste time for nothing by injecting delays.
Not clear if we can / should differentiate catch up vs. other times we process ops.
But overall, I'd say that using setTimetout(0) should be more than enough.
We also should look at telemetry when and how often that happens.
Does it happen mostly on boot? When do we have so many ops to process and why?
The text was updated successfully, but these errors were encountered: