FRS op ack latencies have really high averages and standard deviation #10427

vladsud · 2022-05-25T17:43:01Z

We have really small-scale tests that are run as part of our post-CI flow.
It runs against frs canary & odsp prod, exactly same code and payload, and has 10 ops/s (sequenced) ops, i.e. tiny.
Here are the metrics from these runs:

union office_fluid_ffautomation_*
| where Event_Time > ago(7d)
| where Data_eventName contains "OpRoundtripTime"
| where isnotnull(Data_durationOutboundQueue)
| where Data_driverType == "routerlicious"
| summarize count(), toint(avg(Data_duration)), toint(percentile(Data_duration, 90)), toint(stdev(Data_duration))

	FRS	ODSP
Average	400	105
P90	180	125
Std. dev.	2095	51

Numbers are in milliseconds, and measure end-to-end latency of op being sent by client and received back by client after ordering services acked an op. Numbers include some overhead of client code, we have measured to subtract it, but I'll use same metric across the board as data below is coming from older client builds that do not have that change yet.

Krushboo shared with me results of stress tests run that your team runs, and it has peak 1000 ops/s sequenced rate. I do not know what it runs against (i.e. what tenant, is it isolated, etc.).
The numbers are much worse and beyond reasonable.

database("978e115fcf9846189c84c44420044563").stress_test_error
| where TenantId contains "traces09"
| extend payload=substring(Message,22, 5000)
| extend data = parse_json(payload)
| where data.eventName contains "OpRoundtripTime"
| extend duration = toint(data.duration)
| summarize count(), toint(avg(duration)), toint(percentile(duration, 50)), toint(percentile(duration, 90)), toint(stdev(duration))

	FRS
Average	1826
P90	5615
Std. dev.	3100

Here are results from ODSP scalability run, run by IDC team every week against production tenant, it has sustained 3000 ops/s sequenced ops, and much higher number of broadcast ops (i.e. number of clients/doc is much higher).
More statistics about these runs at the bottom of this file.
Though it runs against cluster consisting of 5 front-end boxes, so very likely workload per box is similar to previous run (assuming FRS above uses single front-end box).

	ODSP
Average	107
P90	126
Std. dev.	35

The numbers are very consistent for ODSP.
They are not consistent for FRS, and standard deviation (and as result – average) are through the roof, even for trivial runs.
Please note that these high latencies and deviations cause a lot of trouble across other areas, including

ContainerClose and NoJoinOp errors during scalability testing #9699

Also related:

Op latency: weird spikes every 2 frames #8908

ghost · 2022-11-22T03:02:16Z

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

vladsud added the bug Something isn't working label May 25, 2022

vladsud assigned yangg-msft May 25, 2022

ghost added the status: stale label Nov 22, 2022

ghost closed this as completed Nov 30, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FRS op ack latencies have really high averages and standard deviation #10427

FRS op ack latencies have really high averages and standard deviation #10427

vladsud commented May 25, 2022

ghost commented Nov 22, 2022

FRS op ack latencies have really high averages and standard deviation #10427

FRS op ack latencies have really high averages and standard deviation #10427

Comments

vladsud commented May 25, 2022

ghost commented Nov 22, 2022