Update telemetry for getDeltas from db #16348

shubhi1092 · 2023-07-12T22:37:31Z

Description

Adding metric to capture getDeltas method.
Adding "caller" property to the telemetry to identify internal vs external calls.

server/routerlicious/packages/routerlicious-base/src/alfred/services/deltaService.ts

zhangxin511 · 2023-07-13T18:11:00Z

server/routerlicious/packages/services-telemetry/src/lumberEventNames.ts

@@ -50,4 +50,6 @@ export enum LumberEventName {
 	ConnectionCountPerNode = "ConnectionCountPerNode",
 	RestoreFromCheckpoint = "RestoreFromCheckpoint",
 	ReprocessOps = "ReprocessOps",
+	GetDeltasFromDb = "GetDeltasFromDb",
+	GetDeltasFromStorage = "GetDeltasFromStorage",


The changes make sense for OSS, however, FRS is not using this piece of logic. Don't see the benefits the metrics added here.

FRS has a file called fallbackDeltaService.ts, which has almost all the metric/status you need. Maybe take a look over there?

Hey Xin, fallbackDeltaService in FRS extends from OSS's deltaService and calls super.getDeltas to get the deltas from db. So GetDeltasFromDb will be useful for both FRS as well as OSS. GetDeltasFromStorage also seemed similar but I was not sure if its indeed used or not, but there's no harm in adding that as well.

Oh I see, make sense. I still don't think you need to add a metric here, the reason is it will double the metric/log numbers for each getdelta call. Instead, you can add a property flag in fallbackDeltaService, and it can indicate if it pass the super.getDeltas successfuly. If not, the fallbackDeltaService will still log the error, or just use the stage flag to indicate. Doing so will keep the log low and keep it simple to analysis (you have everything in one metric so you don't need to do the join correlate different metrics, what you think?

I don't think GetDeltasFromStorage is still being used anywhere, you can find the usage which is by one route, and there is no such route logs in our kusto for past 30 days. So I think this endpoint is an obsolete endpoint

I think adding this metric here would help us in analysing the time taken for the db call during the whole get deltas API duration. We can still do that outside in the fallbackDeltaService, but I thought this would be cleaner since we also have a similar metric for GetDeltasFromFallback to capture that duration.

If amount of logs is a concern, I can alternately add the duration for both GetDeltasFromDb as well as GetDeltasFromFallback to the overall GetDeltaObjectStoreFallback metric and remove this metric from OSS then - what do you think?

I will remove the GetDeltasFromStorage metric since that is obsolete and not needed.

It is up to you, I prefer to use single metrics instead of double the metrics due to

The amount of logs, we currently already have too many logs crashing geneva buffer, which lead to lost logs. Also it leads to kusto ingestion delay. @hedasilv do you have any concerns? Given the GetDeltaObjectStoreFallback log is already number 5 on the list

it is difficult to correlate the two metric from the same call. But if you can figure out a better way to correlate, I think it is fine.

Execute in [Web] [Desktop] [cluster('frspme.westus2.kusto.windows.net').database('FRS')]
FRSRouterlicious
| where eventTimestamp > ago(1d)
| summarize count() by eventName
| order by count_ desc
| take 10

Table0

eventName count_

LogMessage 1,417,995,212

RiddlerFetchTenantKey 926,002,068

RunWithRetry 138,144,104

ScriptoriumProcessBatch 127,897,144

GetDeltaObjectStoreFallback 84,785,599

ReprocessOps 67,206,731

DisconnectDocument 63,701,361

ConnectDocument 58,862,813

ConnectDocumentGetClients 58,857,795

ConnectDocumentAddClient 52,860,490

Hey @zhangxin511 , these are my thoughts and reasoning:

I see the benefit of adding a new metric here to analyse the GetDeltasFromDb trend, its faster to check and plot the direct duration of a metric instead of extracting that from the properties of another metric.

However I also see the benefit of adding this latency to the overall GetDeltaObjectStoreFallback's metric properties since its easier to check/correlate for a single call.

I am going to do # 2 anyways in FRS since we already have that metric. And since number of logs is a concern here, lets not do # 1 for now and will reconsider if needed after adding # 2.

zhangxin511 · 2023-07-17T14:28:21Z

server/routerlicious/packages/services-telemetry/src/lumberEventNames.ts

@@ -50,4 +50,6 @@ export enum LumberEventName {
 	ConnectionCountPerNode = "ConnectionCountPerNode",
 	RestoreFromCheckpoint = "RestoreFromCheckpoint",
 	ReprocessOps = "ReprocessOps",
+	GetDeltasFromDb = "GetDeltasFromDb",
+	GetDeltasFromStorage = "GetDeltasFromStorage",


It is up to you, I prefer to use single metrics instead of double the metrics due to

The amount of logs, we currently already have too many logs crashing geneva buffer, which lead to lost logs. Also it leads to kusto ingestion delay. @hedasilv do you have any concerns? Given the GetDeltaObjectStoreFallback log is already number 5 on the list

it is difficult to correlate the two metric from the same call. But if you can figure out a better way to correlate, I think it is fine.

Execute in [Web] [Desktop] [cluster('frspme.westus2.kusto.windows.net').database('FRS')]
FRSRouterlicious
| where eventTimestamp > ago(1d)
| summarize count() by eventName
| order by count_ desc
| take 10

Table0

eventName count_

LogMessage 1,417,995,212

RiddlerFetchTenantKey 926,002,068

RunWithRetry 138,144,104

ScriptoriumProcessBatch 127,897,144

GetDeltaObjectStoreFallback 84,785,599

ReprocessOps 67,206,731

DisconnectDocument 63,701,361

ConnectDocument 58,862,813

ConnectDocumentGetClients 58,857,795

ConnectDocumentAddClient 52,860,490

server/routerlicious/packages/routerlicious-base/src/alfred/routes/api/deltas.ts

zhangxin511 · 2023-07-17T14:38:44Z

server/routerlicious/packages/routerlicious-base/src/alfred/services/deltaService.ts

@@ -24,23 +29,42 @@ export class DeltaService implements IDeltaService {
 		documentId: string,
 		from?: number,
 		to?: number,
+		caller?: string,


remember to pass this from fallbackservice

Add metric for getDeltas from db

f6f04ab

shubhi1092 requested a review from a team as a code owner July 12, 2023 22:37

shubhi1092 requested review from zhangxin511 and kekachmar and removed request for a team July 12, 2023 22:37

github-actions bot added area: server Server related issues (routerlicious) base: main PRs targeted against main branch labels Jul 12, 2023

shubhi1092 marked this pull request as draft July 12, 2023 22:55

kekachmar reviewed Jul 12, 2023

View reviewed changes

server/routerlicious/packages/routerlicious-base/src/alfred/services/deltaService.ts Outdated Show resolved Hide resolved

kekachmar reviewed Jul 12, 2023

View reviewed changes

server/routerlicious/packages/routerlicious-base/src/alfred/services/deltaService.ts Outdated Show resolved Hide resolved

zhangxin511 reviewed Jul 13, 2023

View reviewed changes

Shubhangi Agarwal added 2 commits July 14, 2023 11:23

Add error to the log

5f87f51

Add caller to the get deltas telemetry

103a401

shubhi1092 changed the title ~~Add metric for getDeltas from db~~ Update telemetry for getDeltas from db Jul 14, 2023

zhangxin511 reviewed Jul 17, 2023

View reviewed changes

Shubhangi Agarwal added 2 commits July 17, 2023 11:31

update caller logic

392f2c5

remove metric for getDeltasFromStorage

5fa10d8

shubhi1092 closed this Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update telemetry for getDeltas from db #16348

Update telemetry for getDeltas from db #16348

shubhi1092 commented Jul 12, 2023 •

edited

Loading

zhangxin511 Jul 13, 2023 •

edited

Loading

shubhi1092 Jul 14, 2023

zhangxin511 Jul 14, 2023 •

edited

Loading

shubhi1092 Jul 14, 2023

zhangxin511 Jul 17, 2023 •

edited

Loading

shubhi1092 Jul 17, 2023 •

edited

Loading

zhangxin511 Jul 17, 2023 •

edited

Loading

zhangxin511 Jul 17, 2023

eventName	count_
LogMessage	1,417,995,212
RiddlerFetchTenantKey	926,002,068
RunWithRetry	138,144,104
ScriptoriumProcessBatch	127,897,144
GetDeltaObjectStoreFallback	84,785,599
ReprocessOps	67,206,731
DisconnectDocument	63,701,361
ConnectDocument	58,862,813
ConnectDocumentGetClients	58,857,795
ConnectDocumentAddClient	52,860,490

Update telemetry for getDeltas from db #16348

Update telemetry for getDeltas from db #16348

Conversation

shubhi1092 commented Jul 12, 2023 • edited Loading

Description

zhangxin511 Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

shubhi1092 Jul 14, 2023

Choose a reason for hiding this comment

zhangxin511 Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

shubhi1092 Jul 14, 2023

Choose a reason for hiding this comment

zhangxin511 Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

shubhi1092 Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

zhangxin511 Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

zhangxin511 Jul 17, 2023

Choose a reason for hiding this comment

shubhi1092 commented Jul 12, 2023 •

edited

Loading

zhangxin511 Jul 13, 2023 •

edited

Loading

zhangxin511 Jul 14, 2023 •

edited

Loading

zhangxin511 Jul 17, 2023 •

edited

Loading

shubhi1092 Jul 17, 2023 •

edited

Loading

zhangxin511 Jul 17, 2023 •

edited

Loading