Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRS op ack latencies have really high averages and standard deviation #10427

Closed
vladsud opened this issue May 25, 2022 · 1 comment
Closed

FRS op ack latencies have really high averages and standard deviation #10427

vladsud opened this issue May 25, 2022 · 1 comment
Assignees
Labels
bug Something isn't working status: stale

Comments

@vladsud
Copy link
Contributor

vladsud commented May 25, 2022

We have really small-scale tests that are run as part of our post-CI flow.
It runs against frs canary & odsp prod, exactly same code and payload, and has 10 ops/s (sequenced) ops, i.e. tiny.
Here are the metrics from these runs:

union office_fluid_ffautomation_*
| where Event_Time > ago(7d)
| where Data_eventName contains "OpRoundtripTime"
| where isnotnull(Data_durationOutboundQueue)
| where Data_driverType == "routerlicious"
| summarize count(), toint(avg(Data_duration)), toint(percentile(Data_duration, 90)), toint(stdev(Data_duration))

  FRS ODSP
Average 400 105
P90 180 125
Std. dev. 2095 51

Numbers are in milliseconds, and measure end-to-end latency of op being sent by client and received back by client after ordering services acked an op. Numbers include some overhead of client code, we have measured to subtract it, but I'll use same metric across the board as data below is coming from older client builds that do not have that change yet.

Krushboo shared with me results of stress tests run that your team runs, and it has peak 1000 ops/s sequenced rate. I do not know what it runs against (i.e. what tenant, is it isolated, etc.).
The numbers are much worse and beyond reasonable.

database("978e115fcf9846189c84c44420044563").stress_test_error
| where TenantId contains "traces09"
| extend payload=substring(Message,22, 5000)
| extend data = parse_json(payload)
| where data.eventName contains "OpRoundtripTime"
| extend duration = toint(data.duration)
| summarize count(), toint(avg(duration)), toint(percentile(duration, 50)), toint(percentile(duration, 90)), toint(stdev(duration))

  FRS
Average 1826
P90 5615
Std. dev. 3100

Here are results from ODSP scalability run, run by IDC team every week against production tenant, it has sustained 3000 ops/s sequenced ops, and much higher number of broadcast ops (i.e. number of clients/doc is much higher).
More statistics about these runs at the bottom of this file.
Though it runs against cluster consisting of 5 front-end boxes, so very likely workload per box is similar to previous run (assuming FRS above uses single front-end box).

  ODSP
Average 107
P90 126
Std. dev. 35

The numbers are very consistent for ODSP.
They are not consistent for FRS, and standard deviation (and as result – average) are through the roof, even for trivial runs.
Please note that these high latencies and deviations cause a lot of trouble across other areas, including

Also related:

@vladsud vladsud added the bug Something isn't working label May 25, 2022
@ghost ghost added the status: stale label Nov 22, 2022
@ghost
Copy link

ghost commented Nov 22, 2022

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@ghost ghost closed this as completed Nov 30, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status: stale
Projects
None yet
Development

No branches or pull requests

2 participants