Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Op latency: weird spikes every 2 frames #8908

Closed
vladsud opened this issue Jan 27, 2022 · 12 comments
Closed

Op latency: weird spikes every 2 frames #8908

vladsud opened this issue Jan 27, 2022 · 12 comments
Assignees
Labels
bug Something isn't working perf status: stale

Comments

@vladsud
Copy link
Contributor

vladsud commented Jan 27, 2022

We need to look why we have peeks every 31.5 seconds.
I can understand 16ms (assuming 60FPS system clock), but 2 ticks looks weird.

Primary suspect - DeltaScheduler, who does 20ms delays (that would translate into 2 ticks).
We should also re-evaluate once BatchManager is out, and implicit batching at JS turn are in.

We also need better view into all stages of this process:

  1. How long ops are sitting on machine until they emitted to socket (that would catch/show batch manager impact)
  2. Actual time to ack op (same as today but minus # 1 & # 2, i.e., only time to network and service time).
  3. How long ops are sitting in inbound queue until it is processed (that would capture impact of DeltaScheduler & ScheduleManager).

Office_Fluid_FluidRuntime_Performance
| where Event_Time > ago(5d)
| where Data_eventName == "fluid:telemetry:OpPerf:OpRoundtripTime"
| where Data_duration >= 0
| summarize count() by bin(Data_duration, 1)
| render linechart
| where Data_duration <= 500

image

@vladsud vladsud added bug Something isn't working perf labels Jan 27, 2022
@vladsud vladsud added this to the February 2022 milestone Jan 27, 2022
@vladsud vladsud self-assigned this Jan 27, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Jan 27, 2022

I see 531K "InboundOpsProcessingTime" events in one day.
So yes, DeltaScheduler seems like is being hit a lot.

One thing that I do not like is that we issue event only if we hit 2K idle states.
It feels like once we set a timer in batchEnd(), we have to report an event that it happened, and ideally

  1. how many ops we already processed till that moment
  2. how many ops are remained in queue
  3. Once done (reached idle), how many ops actually were processed (will be different from # 2, as ops are coming in).

BTW, this.isScheduling seems can be removed. It's always true when we process ops, and always false when we stop processing ops. Given that we always alternate between these states, there is no reason to check the value - we know what the value is

@vladsud
Copy link
Contributor Author

vladsud commented Jan 30, 2022

@andre4i, I'll assign it to you, as I think most likely this is batch manager.
That said, I opened a bug (see above) for DeltaScheduler as well, and it is also a big suspect, given it uses 20ms delays, which will map nicely to 32ms on 60fps system clock system. Please work with Navin figure out next steps here.
Also note that I've opened couple issues to better track various parts of processing pipelines to quickly figure out where problems are.

@andre4i andre4i modified the milestones: February 2022, March 2022 Mar 1, 2022
@andre4i andre4i modified the milestones: March 2022, April 2022 Mar 19, 2022
@vladsud vladsud assigned vladsud and unassigned andre4i Mar 29, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Mar 30, 2022

#9505, #8909, #8912 are the issues tracking actual improvements and collection of more data to get more insights into the problem.
So, moving this issue to May to give enough time to have time to analyze data when it's available.

@vladsud
Copy link
Contributor Author

vladsud commented May 16, 2022

We still do not have data (latest changes are not deployed), i.e. "OpRoundtripTime" events do not have yet Data_durationInboundQueue property. Pushing it out.

@vladsud vladsud modified the milestones: May 2022, June 2022 May 16, 2022
@vladsud
Copy link
Contributor Author

vladsud commented May 19, 2022

We have some data.

let P=95;
union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "OpRoundtripTime"
| where isnotnull(Data_durationOutboundQueue)
| summarize count(), percentile(Data_durationInboundQueue, P), percentile(Data_durationInboundToProcessing, P), percentile(Data_lengthInboundQueue, P)

count_ percentile_Data_durationInboundQueue_95 percentile_Data_durationInboundToProcessing_95 percentile_Data_lengthInboundQueue_95
7,874 364.18326118326115 25.154761904761905 1

Key observations:

  1. durationInboundQueue is really hig.
  2. durationInboundToProcessing: It's average is almost 2x worse than P95! Suggesting really heavy tail
  3. Queue length is tiny, so it's not number of ops.

durationInboundQueue distribution:
image

durationInboundToProcessing distribution:
image

And duration:
image

So, looks like durationInboundQueue is the main cause of the problems!

@vladsud
Copy link
Contributor Author

vladsud commented May 22, 2022

this.opPerfData.durationInboundQueue = this.opProcessingTimes.opStartTimeInboundPushEvent - this.opProcessingTimes.opStartTimeSittingInboundQueue;

opStartTimeInboundPushEvent - "push" time for inbound queue
opStartTimeSittingInboundQueue - "push" time for outbound queue

That eliminates DeltaScheduler as the problem.
Note that inbound queue is sorted, so if ops are coming out of order, we will not add them until we fill in the gap.
Outbound queue is filled in only if we are connected as "write", which should be fine, as switching from "read" to "write" will reset all telemetry and will not send data including transition to write.

I highly doubt that gaps in ops are causing it - we observer very few gaps in general once connection is established.
And client does not send ops (and thus does not measure these times) until it is fully caught up on connection.

We do not have signal data like that, but manual experiments show that signals are much smoother in terms of end-to-end feedback. So, it's very unlikely socket.io depending on system clock to process incoming messages.
This suggest the most likely cause is PUSH.

@vladsud
Copy link
Contributor Author

vladsud commented May 22, 2022

union office_fluid_ffautomation_*
| where Data_eventName contains "OpRoundtripTime"
| where isnotnull(Data_durationOutboundQueue)
| summarize count(), toint(avg(Data_durationInboundQueue)), toint(percentile(Data_durationInboundQueue, 95)), percentile(Data_pingLatency, 95) by Data_driverType

Data_driverType count_ avg_Data_durationInboundQueue percentile_Data_durationInboundQueue_95 percentile_Data_pingLatency_95
routerlicious 132,896 668 1,972 4,162
odsp 253,882 103 125 45
tinylicious 91,105 7 26  
local 12,652 1 5  

Ping times are collected from previous ping on socket, so it can be above rountrip times.

This data is very concerning. It confirms that op rountrip times high and have weird spikes in distribution most likely due to service processing (ordering service).
What is even more concerning, that FRS numbers are substantially worse than ODSP.

@GaryWilber , @znewton, I do not know how to drive this to resolution, but if you have service numbers in terms of how long it takes to process ops in ordering service, and if we can do distribution chart to see if we observe similar spikes as what we see from client, then we should look into this data.

@vladsud
Copy link
Contributor Author

vladsud commented May 22, 2022

We have actual ping data, but number of hits is extremely low (499 hits in last 2 weeks), and data exists only for FRS, so I'd not look into it that much:

union office_fluid_ffautomation_*
| where Event_Time > ago(14d)
| where Data_eventName contains "DeltaLatency"
| summarize count(), toint(avg(Data_duration)), toint(percentile(Data_duration, 95)) by Data_driverType

Here is data from production (only ODSP) for ping latencies - we do not observe timer-based spikes in this data.

union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "DeltaLatency"
| where Data_runtimeVersion contains "0.58.2002"
| where Data_duration < 500
| summarize count() by Data_duration
| render linechart

image

@vladsud
Copy link
Contributor Author

vladsud commented May 22, 2022

ODSP stress tests (run by IDC team - much bigger scale) - distribution of durationInboundQueue times.
Same observations - distribution has multiple spikes.

image

@vladsud
Copy link
Contributor Author

vladsud commented May 23, 2022

R11S data is really weird and does not match ODSP pattern. It has huge tail and humongous standard deviation.
At the same time, we do not observe same problem of multiple spikes in data, it's standard hockey-stick kind of distribution (with very long tail). It's pretty obvious from the fact that P90 < average!

union office_fluid_ffautomation_*
| where Data_eventName contains "OpRoundtripTime"
| where isnotnull(Data_durationOutboundQueue)
| summarize count(), toint(avg(Data_durationInboundQueue)), toint(percentile(Data_durationInboundQueue, 50)), toint(percentile(Data_durationInboundQueue, 90)), toint(stdev(Data_durationInboundQueue))
by Data_driverType

Data_driverType count_ avg P50 P90 stdev
routerlicious 136,867 653 104 322 2959
odsp 261,408 103 93 124 49
tinylicious 93,229 7 4 14 9
local 12,876 1 1 2 3

@vladsud
Copy link
Contributor Author

vladsud commented May 23, 2022

@microsoft-github-policy-service
Copy link
Contributor

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working perf status: stale
Projects
None yet
Development

No branches or pull requests

2 participants