Op latency: weird spikes every 2 frames #8908

vladsud · 2022-01-27T22:30:51Z

We need to look why we have peeks every 31.5 seconds.
I can understand 16ms (assuming 60FPS system clock), but 2 ticks looks weird.

Primary suspect - DeltaScheduler, who does 20ms delays (that would translate into 2 ticks).
We should also re-evaluate once BatchManager is out, and implicit batching at JS turn are in.

We also need better view into all stages of this process:

How long ops are sitting on machine until they emitted to socket (that would catch/show batch manager impact)
Actual time to ack op (same as today but minus # 1 & # 2, i.e., only time to network and service time).
How long ops are sitting in inbound queue until it is processed (that would capture impact of DeltaScheduler & ScheduleManager).

vladsud · 2022-01-27T22:46:06Z

I see 531K "InboundOpsProcessingTime" events in one day.
So yes, DeltaScheduler seems like is being hit a lot.

One thing that I do not like is that we issue event only if we hit 2K idle states.
It feels like once we set a timer in batchEnd(), we have to report an event that it happened, and ideally

how many ops we already processed till that moment
how many ops are remained in queue
Once done (reached idle), how many ops actually were processed (will be different from # 2, as ops are coming in).

BTW, this.isScheduling seems can be removed. It's always true when we process ops, and always false when we stop processing ops. Given that we always alternate between these states, there is no reason to check the value - we know what the value is

vladsud · 2022-01-30T06:52:05Z

@andre4i, I'll assign it to you, as I think most likely this is batch manager.
That said, I opened a bug (see above) for DeltaScheduler as well, and it is also a big suspect, given it uses 20ms delays, which will map nicely to 32ms on 60fps system clock system. Please work with Navin figure out next steps here.
Also note that I've opened couple issues to better track various parts of processing pipelines to quickly figure out where problems are.

vladsud · 2022-03-30T21:15:23Z

#9505, #8909, #8912 are the issues tracking actual improvements and collection of more data to get more insights into the problem.
So, moving this issue to May to give enough time to have time to analyze data when it's available.

vladsud · 2022-05-16T15:49:38Z

We still do not have data (latest changes are not deployed), i.e. "OpRoundtripTime" events do not have yet Data_durationInboundQueue property. Pushing it out.

vladsud · 2022-05-19T05:14:35Z

We have some data.

let P=95;
union Office_Fluid_FluidRuntime_*
| where Data_eventName contains "OpRoundtripTime"
| where isnotnull(Data_durationOutboundQueue)
| summarize count(), percentile(Data_durationInboundQueue, P), percentile(Data_durationInboundToProcessing, P), percentile(Data_lengthInboundQueue, P)

count_	percentile_Data_durationInboundQueue_95	percentile_Data_durationInboundToProcessing_95	percentile_Data_lengthInboundQueue_95
7,874	364.18326118326115	25.154761904761905	1

Key observations:

durationInboundQueue is really hig.
durationInboundToProcessing: It's average is almost 2x worse than P95! Suggesting really heavy tail
Queue length is tiny, so it's not number of ops.

durationInboundQueue distribution:

durationInboundToProcessing distribution:

And duration:

So, looks like durationInboundQueue is the main cause of the problems!

vladsud · 2022-05-22T15:49:57Z

this.opPerfData.durationInboundQueue = this.opProcessingTimes.opStartTimeInboundPushEvent - this.opProcessingTimes.opStartTimeSittingInboundQueue;

opStartTimeInboundPushEvent - "push" time for inbound queue
opStartTimeSittingInboundQueue - "push" time for outbound queue

That eliminates DeltaScheduler as the problem.
Note that inbound queue is sorted, so if ops are coming out of order, we will not add them until we fill in the gap.
Outbound queue is filled in only if we are connected as "write", which should be fine, as switching from "read" to "write" will reset all telemetry and will not send data including transition to write.

I highly doubt that gaps in ops are causing it - we observer very few gaps in general once connection is established.
And client does not send ops (and thus does not measure these times) until it is fully caught up on connection.

We do not have signal data like that, but manual experiments show that signals are much smoother in terms of end-to-end feedback. So, it's very unlikely socket.io depending on system clock to process incoming messages.
This suggest the most likely cause is PUSH.

vladsud · 2022-05-22T16:10:06Z

union office_fluid_ffautomation_*
| where Data_eventName contains "OpRoundtripTime"
| where isnotnull(Data_durationOutboundQueue)
| summarize count(), toint(avg(Data_durationInboundQueue)), toint(percentile(Data_durationInboundQueue, 95)), percentile(Data_pingLatency, 95) by Data_driverType

Data_driverType	count_	avg_Data_durationInboundQueue	percentile_Data_durationInboundQueue_95	percentile_Data_pingLatency_95
routerlicious	132,896	668	1,972	4,162
odsp	253,882	103	125	45
tinylicious	91,105	7	26
local	12,652	1	5

Ping times are collected from previous ping on socket, so it can be above rountrip times.

This data is very concerning. It confirms that op rountrip times high and have weird spikes in distribution most likely due to service processing (ordering service).
What is even more concerning, that FRS numbers are substantially worse than ODSP.

@GaryWilber , @znewton, I do not know how to drive this to resolution, but if you have service numbers in terms of how long it takes to process ops in ordering service, and if we can do distribution chart to see if we observe similar spikes as what we see from client, then we should look into this data.

vladsud · 2022-05-22T16:13:29Z

We have actual ping data, but number of hits is extremely low (499 hits in last 2 weeks), and data exists only for FRS, so I'd not look into it that much:

union office_fluid_ffautomation_*
| where Event_Time > ago(14d)
| where Data_eventName contains "DeltaLatency"
| summarize count(), toint(avg(Data_duration)), toint(percentile(Data_duration, 95)) by Data_driverType

Here is data from production (only ODSP) for ping latencies - we do not observe timer-based spikes in this data.

vladsud · 2022-05-22T16:46:17Z

ODSP stress tests (run by IDC team - much bigger scale) - distribution of durationInboundQueue times.
Same observations - distribution has multiple spikes.

vladsud · 2022-05-23T02:35:47Z

R11S data is really weird and does not match ODSP pattern. It has huge tail and humongous standard deviation.
At the same time, we do not observe same problem of multiple spikes in data, it's standard hockey-stick kind of distribution (with very long tail). It's pretty obvious from the fact that P90 < average!

union office_fluid_ffautomation_*
| where Data_eventName contains "OpRoundtripTime"
| where isnotnull(Data_durationOutboundQueue)
| summarize count(), toint(avg(Data_durationInboundQueue)), toint(percentile(Data_durationInboundQueue, 50)), toint(percentile(Data_durationInboundQueue, 90)), toint(stdev(Data_durationInboundQueue))
by Data_driverType

Data_driverType	count_	avg	P50	P90	stdev
routerlicious	136,867	653	104	322	2959
odsp	261,408	103	93	124	49
tinylicious	93,229	7	4	14	9
local	12,876	1	1	2	3

vladsud · 2022-05-23T17:22:15Z

From Gary:

Random latency spikes when producing messages · Issue #2912 · edenhill/librdkafka (github.com)

Significant performance regression between 1.6.1 and 1.7.0 · Issue #3538 · edenhill/librdkafka (github.com)
librdkafka/CHANGELOG.md at master · edenhill/librdkafka (github.com) v1.9.0 - a lot of relevant changes

FRS is linux based and Push is window's based. So the queuing/timer logic in the kafka library will differ across the two.

microsoft-github-policy-service · 2023-02-01T17:35:02Z

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

vladsud added bug Something isn't working perf labels Jan 27, 2022

vladsud added this to the February 2022 milestone Jan 27, 2022

vladsud self-assigned this Jan 27, 2022

This was referenced Jan 27, 2022

Improve logging in DeltaScheduler #8909

Closed

Track op roundtrip time without any impact from runtime side #8911

Closed

vladsud assigned andre4i and unassigned vladsud Jan 30, 2022

andre4i modified the milestones: February 2022, March 2022 Mar 1, 2022

andre4i modified the milestones: March 2022, April 2022 Mar 19, 2022

vladsud assigned vladsud and unassigned andre4i Mar 29, 2022

vladsud modified the milestones: April 2022, May 2022 Mar 30, 2022

andre4i mentioned this issue Apr 8, 2022

Improve logging in DeltaScheduler #9788

Merged

vladsud modified the milestones: May 2022, June 2022 May 16, 2022

This was referenced May 22, 2022

DeltaLatency events are not frequent enough in automation #10398

Closed

Collect roundtrip latencies for signals #10399

Closed

vladsud mentioned this issue May 25, 2022

FRS op ack latencies have really high averages and standard deviation #10427

Closed

vladsud modified the milestones: June 2022, Future Jun 3, 2022

microsoft-github-policy-service bot added the status: stale label Feb 1, 2023

microsoft-github-policy-service bot removed this from the Future milestone Apr 23, 2023

microsoft-github-policy-service bot closed this as completed Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Op latency: weird spikes every 2 frames #8908

Op latency: weird spikes every 2 frames #8908

vladsud commented Jan 27, 2022 •

edited

Loading

vladsud commented Jan 27, 2022 •

edited

Loading

vladsud commented Jan 30, 2022

vladsud commented Mar 30, 2022

vladsud commented May 16, 2022

vladsud commented May 19, 2022

vladsud commented May 22, 2022

vladsud commented May 22, 2022

vladsud commented May 22, 2022 •

edited

Loading

vladsud commented May 22, 2022

vladsud commented May 23, 2022

vladsud commented May 23, 2022 •

edited

Loading

microsoft-github-policy-service bot commented Feb 1, 2023

Op latency: weird spikes every 2 frames #8908

Op latency: weird spikes every 2 frames #8908

Comments

vladsud commented Jan 27, 2022 • edited Loading

vladsud commented Jan 27, 2022 • edited Loading

vladsud commented Jan 30, 2022

vladsud commented Mar 30, 2022

vladsud commented May 16, 2022

vladsud commented May 19, 2022

vladsud commented May 22, 2022

vladsud commented May 22, 2022

vladsud commented May 22, 2022 • edited Loading

vladsud commented May 22, 2022

vladsud commented May 23, 2022

vladsud commented May 23, 2022 • edited Loading

microsoft-github-policy-service bot commented Feb 1, 2023

vladsud commented Jan 27, 2022 •

edited

Loading

vladsud commented Jan 27, 2022 •

edited

Loading

vladsud commented May 22, 2022 •

edited

Loading

vladsud commented May 23, 2022 •

edited

Loading