-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update telemetry for getDeltas from db #16348
Conversation
server/routerlicious/packages/routerlicious-base/src/alfred/services/deltaService.ts
Outdated
Show resolved
Hide resolved
server/routerlicious/packages/routerlicious-base/src/alfred/services/deltaService.ts
Outdated
Show resolved
Hide resolved
@@ -50,4 +50,6 @@ export enum LumberEventName { | |||
ConnectionCountPerNode = "ConnectionCountPerNode", | |||
RestoreFromCheckpoint = "RestoreFromCheckpoint", | |||
ReprocessOps = "ReprocessOps", | |||
GetDeltasFromDb = "GetDeltasFromDb", | |||
GetDeltasFromStorage = "GetDeltasFromStorage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense for OSS, however, FRS is not using this piece of logic. Don't see the benefits the metrics added here.
FRS has a file called fallbackDeltaService.ts, which has almost all the metric/status you need. Maybe take a look over there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Xin, fallbackDeltaService in FRS extends from OSS's deltaService and calls super.getDeltas to get the deltas from db. So GetDeltasFromDb will be useful for both FRS as well as OSS. GetDeltasFromStorage also seemed similar but I was not sure if its indeed used or not, but there's no harm in adding that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, make sense. I still don't think you need to add a metric here, the reason is it will double the metric/log numbers for each getdelta call. Instead, you can add a property flag in fallbackDeltaService, and it can indicate if it pass the super.getDeltas
successfuly. If not, the fallbackDeltaService will still log the error, or just use the stage flag to indicate. Doing so will keep the log low and keep it simple to analysis (you have everything in one metric so you don't need to do the join correlate different metrics, what you think?
I don't think GetDeltasFromStorage
is still being used anywhere, you can find the usage which is by one route, and there is no such route logs in our kusto for past 30 days. So I think this endpoint is an obsolete endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding this metric here would help us in analysing the time taken for the db call during the whole get deltas API duration. We can still do that outside in the fallbackDeltaService, but I thought this would be cleaner since we also have a similar metric for GetDeltasFromFallback to capture that duration.
If amount of logs is a concern, I can alternately add the duration for both GetDeltasFromDb as well as GetDeltasFromFallback to the overall GetDeltaObjectStoreFallback metric and remove this metric from OSS then - what do you think?
I will remove the GetDeltasFromStorage metric since that is obsolete and not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is up to you, I prefer to use single metrics instead of double the metrics due to
- The amount of logs, we currently already have too many logs crashing geneva buffer, which lead to lost logs. Also it leads to kusto ingestion delay. @hedasilv do you have any concerns? Given the
GetDeltaObjectStoreFallback
log is already number 5 on the list - it is difficult to correlate the two metric from the same call. But if you can figure out a better way to correlate, I think it is fine.
Execute in [Web] [Desktop] [cluster('frspme.westus2.kusto.windows.net').database('FRS')]
Table0
eventName | count_ |
---|---|
LogMessage | 1,417,995,212 |
RiddlerFetchTenantKey | 926,002,068 |
RunWithRetry | 138,144,104 |
ScriptoriumProcessBatch | 127,897,144 |
GetDeltaObjectStoreFallback | 84,785,599 |
ReprocessOps | 67,206,731 |
DisconnectDocument | 63,701,361 |
ConnectDocument | 58,862,813 |
ConnectDocumentGetClients | 58,857,795 |
ConnectDocumentAddClient | 52,860,490 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zhangxin511 , these are my thoughts and reasoning:
- I see the benefit of adding a new metric here to analyse the GetDeltasFromDb trend, its faster to check and plot the direct duration of a metric instead of extracting that from the properties of another metric.
- However I also see the benefit of adding this latency to the overall GetDeltaObjectStoreFallback's metric properties since its easier to check/correlate for a single call.
I am going to do # 2 anyways in FRS since we already have that metric. And since number of logs is a concern here, lets not do # 1 for now and will reconsider if needed after adding # 2.
@@ -50,4 +50,6 @@ export enum LumberEventName { | |||
ConnectionCountPerNode = "ConnectionCountPerNode", | |||
RestoreFromCheckpoint = "RestoreFromCheckpoint", | |||
ReprocessOps = "ReprocessOps", | |||
GetDeltasFromDb = "GetDeltasFromDb", | |||
GetDeltasFromStorage = "GetDeltasFromStorage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is up to you, I prefer to use single metrics instead of double the metrics due to
- The amount of logs, we currently already have too many logs crashing geneva buffer, which lead to lost logs. Also it leads to kusto ingestion delay. @hedasilv do you have any concerns? Given the
GetDeltaObjectStoreFallback
log is already number 5 on the list - it is difficult to correlate the two metric from the same call. But if you can figure out a better way to correlate, I think it is fine.
Execute in [Web] [Desktop] [cluster('frspme.westus2.kusto.windows.net').database('FRS')]
Table0
eventName | count_ |
---|---|
LogMessage | 1,417,995,212 |
RiddlerFetchTenantKey | 926,002,068 |
RunWithRetry | 138,144,104 |
ScriptoriumProcessBatch | 127,897,144 |
GetDeltaObjectStoreFallback | 84,785,599 |
ReprocessOps | 67,206,731 |
DisconnectDocument | 63,701,361 |
ConnectDocument | 58,862,813 |
ConnectDocumentGetClients | 58,857,795 |
ConnectDocumentAddClient | 52,860,490 |
server/routerlicious/packages/routerlicious-base/src/alfred/routes/api/deltas.ts
Outdated
Show resolved
Hide resolved
@@ -24,23 +29,42 @@ export class DeltaService implements IDeltaService { | |||
documentId: string, | |||
from?: number, | |||
to?: number, | |||
caller?: string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember to pass this from fallbackservice
Description