-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Op latency: weird spikes every 2 frames #8908
Comments
I see 531K "InboundOpsProcessingTime" events in one day. One thing that I do not like is that we issue event only if we hit 2K idle states.
BTW, this.isScheduling seems can be removed. It's always true when we process ops, and always false when we stop processing ops. Given that we always alternate between these states, there is no reason to check the value - we know what the value is |
@andre4i, I'll assign it to you, as I think most likely this is batch manager. |
We still do not have data (latest changes are not deployed), i.e. "OpRoundtripTime" events do not have yet Data_durationInboundQueue property. Pushing it out. |
this.opPerfData.durationInboundQueue = this.opProcessingTimes.opStartTimeInboundPushEvent - this.opProcessingTimes.opStartTimeSittingInboundQueue; opStartTimeInboundPushEvent - "push" time for inbound queue That eliminates DeltaScheduler as the problem. I highly doubt that gaps in ops are causing it - we observer very few gaps in general once connection is established. We do not have signal data like that, but manual experiments show that signals are much smoother in terms of end-to-end feedback. So, it's very unlikely socket.io depending on system clock to process incoming messages. |
Ping times are collected from previous ping on socket, so it can be above rountrip times. This data is very concerning. It confirms that op rountrip times high and have weird spikes in distribution most likely due to service processing (ordering service). @GaryWilber , @znewton, I do not know how to drive this to resolution, but if you have service numbers in terms of how long it takes to process ops in ordering service, and if we can do distribution chart to see if we observe similar spikes as what we see from client, then we should look into this data. |
We have actual ping data, but number of hits is extremely low (499 hits in last 2 weeks), and data exists only for FRS, so I'd not look into it that much:
Here is data from production (only ODSP) for ping latencies - we do not observe timer-based spikes in this data. union Office_Fluid_FluidRuntime_* |
R11S data is really weird and does not match ODSP pattern. It has huge tail and humongous standard deviation.
|
From Gary: Random latency spikes when producing messages · Issue #2912 · edenhill/librdkafka (github.com) Significant performance regression between 1.6.1 and 1.7.0 · Issue #3538 · edenhill/librdkafka (github.com) FRS is linux based and Push is window's based. So the queuing/timer logic in the kafka library will differ across the two. |
This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
We need to look why we have peeks every 31.5 seconds.
I can understand 16ms (assuming 60FPS system clock), but 2 ticks looks weird.
Primary suspect - DeltaScheduler, who does 20ms delays (that would translate into 2 ticks).
We should also re-evaluate once BatchManager is out, and implicit batching at JS turn are in.
We also need better view into all stages of this process:
Office_Fluid_FluidRuntime_Performance
| where Event_Time > ago(5d)
| where Data_eventName == "fluid:telemetry:OpPerf:OpRoundtripTime"
| where Data_duration >= 0
| summarize count() by bin(Data_duration, 1)
| render linechart
| where Data_duration <= 500
The text was updated successfully, but these errors were encountered: