Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update telemetry for getDeltas from db #16348

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ export function create(
const from = stringToSequenceNumber(request.query.from);
const to = stringToSequenceNumber(request.query.to);
const tenantId = getParam(request.params, "tenantId") || appTenants[0].id;
const caller = request.query.token?.toString().length > 0 ? "external" : "scribe";
shubhi1092 marked this conversation as resolved.
Show resolved Hide resolved

// Query for the deltas and return a filtered version of just the operations field
const deltasP = deltaService.getDeltas(
Expand All @@ -155,6 +156,7 @@ export function create(
getParam(request.params, "id"),
from,
to,
caller,
);

handleResponse(deltasP, response, undefined, 500);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ import {
ITenantManager,
MongoManager,
} from "@fluidframework/server-services-core";
import {
BaseTelemetryProperties,
LumberEventName,
Lumberjack,
} from "@fluidframework/server-services-telemetry";

export class DeltaService implements IDeltaService {
constructor(
Expand All @@ -24,23 +29,42 @@ export class DeltaService implements IDeltaService {
documentId: string,
from?: number,
to?: number,
caller?: string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to pass this from fallbackservice

): Promise<ISequencedDocumentMessage[]> {
// Create an optional filter to restrict the delta range
const query: any = { documentId, tenantId };
if (from !== undefined || to !== undefined) {
query["operation.sequenceNumber"] = {};
const metricProperties = {
[BaseTelemetryProperties.tenantId]: tenantId,
[BaseTelemetryProperties.documentId]: documentId,
from,
to,
caller,
};
const metric = Lumberjack.newLumberMetric(
LumberEventName.GetDeltasFromDb,
metricProperties,
);
try {
// Create an optional filter to restrict the delta range
const query: any = { documentId, tenantId };
if (from !== undefined || to !== undefined) {
query["operation.sequenceNumber"] = {};

if (from !== undefined) {
query["operation.sequenceNumber"].$gt = from;
}
if (from !== undefined) {
query["operation.sequenceNumber"].$gt = from;
}

if (to !== undefined) {
query["operation.sequenceNumber"].$lt = to;
if (to !== undefined) {
query["operation.sequenceNumber"].$lt = to;
}
}
}

const sort = { "operation.sequenceNumber": 1 };
return this.queryDeltas(collectionName, query, sort);
const sort = { "operation.sequenceNumber": 1 };
const deltas = this.queryDeltas(collectionName, query, sort);
metric.success("getDeltasFromDb succeeded");
return deltas;
} catch (err) {
metric.error("getDeltasFromDb failed", err);
throw err;
}
}

public async getDeltasFromStorage(
Expand All @@ -50,17 +74,34 @@ export class DeltaService implements IDeltaService {
fromSeq?: number,
toSeq?: number,
): Promise<ISequencedDocumentMessage[]> {
const query: any = { documentId, tenantId, scheduledDeletionTime: { $exists: false } };
query["operation.sequenceNumber"] = {};
if (fromSeq !== undefined) {
query["operation.sequenceNumber"].$gt = fromSeq;
}
if (toSeq !== undefined) {
query["operation.sequenceNumber"].$lt = toSeq;
}
const metricProperties = {
[BaseTelemetryProperties.tenantId]: tenantId,
[BaseTelemetryProperties.documentId]: documentId,
fromSeq,
toSeq,
};
const metric = Lumberjack.newLumberMetric(
LumberEventName.GetDeltasFromStorage,
metricProperties,
);
try {
const query: any = { documentId, tenantId, scheduledDeletionTime: { $exists: false } };
query["operation.sequenceNumber"] = {};
if (fromSeq !== undefined) {
query["operation.sequenceNumber"].$gt = fromSeq;
}
if (toSeq !== undefined) {
query["operation.sequenceNumber"].$lt = toSeq;
}

const sort = { "operation.sequenceNumber": 1 };
return this.queryDeltas(collectionName, query, sort);
const sort = { "operation.sequenceNumber": 1 };
const deltas = this.queryDeltas(collectionName, query, sort);
metric.success("getDeltasFromStorage succeeded");
return deltas;
} catch (err) {
metric.error("getDeltasFromStorage failed", err);
throw err;
}
}

private async queryDeltas(
Expand Down
1 change: 1 addition & 0 deletions server/routerlicious/packages/services-core/src/delta.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ export interface IDeltaService {
documentId: string,
from?: number,
to?: number,
caller?: string,
): Promise<ISequencedDocumentMessage[]>;

getDeltasFromStorage(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,6 @@ export enum LumberEventName {
ConnectionCountPerNode = "ConnectionCountPerNode",
RestoreFromCheckpoint = "RestoreFromCheckpoint",
ReprocessOps = "ReprocessOps",
GetDeltasFromDb = "GetDeltasFromDb",
GetDeltasFromStorage = "GetDeltasFromStorage",
Copy link
Contributor

@zhangxin511 zhangxin511 Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense for OSS, however, FRS is not using this piece of logic. Don't see the benefits the metrics added here.

FRS has a file called fallbackDeltaService.ts, which has almost all the metric/status you need. Maybe take a look over there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Xin, fallbackDeltaService in FRS extends from OSS's deltaService and calls super.getDeltas to get the deltas from db. So GetDeltasFromDb will be useful for both FRS as well as OSS. GetDeltasFromStorage also seemed similar but I was not sure if its indeed used or not, but there's no harm in adding that as well.

Copy link
Contributor

@zhangxin511 zhangxin511 Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, make sense. I still don't think you need to add a metric here, the reason is it will double the metric/log numbers for each getdelta call. Instead, you can add a property flag in fallbackDeltaService, and it can indicate if it pass the super.getDeltas successfuly. If not, the fallbackDeltaService will still log the error, or just use the stage flag to indicate. Doing so will keep the log low and keep it simple to analysis (you have everything in one metric so you don't need to do the join correlate different metrics, what you think?

I don't think GetDeltasFromStorage is still being used anywhere, you can find the usage which is by one route, and there is no such route logs in our kusto for past 30 days. So I think this endpoint is an obsolete endpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this metric here would help us in analysing the time taken for the db call during the whole get deltas API duration. We can still do that outside in the fallbackDeltaService, but I thought this would be cleaner since we also have a similar metric for GetDeltasFromFallback to capture that duration.

If amount of logs is a concern, I can alternately add the duration for both GetDeltasFromDb as well as GetDeltasFromFallback to the overall GetDeltaObjectStoreFallback metric and remove this metric from OSS then - what do you think?

I will remove the GetDeltasFromStorage metric since that is obsolete and not needed.

Copy link
Contributor

@zhangxin511 zhangxin511 Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is up to you, I prefer to use single metrics instead of double the metrics due to

  1. The amount of logs, we currently already have too many logs crashing geneva buffer, which lead to lost logs. Also it leads to kusto ingestion delay. @hedasilv do you have any concerns? Given the GetDeltaObjectStoreFallback log is already number 5 on the list
  2. it is difficult to correlate the two metric from the same call. But if you can figure out a better way to correlate, I think it is fine.

Execute in [Web] [Desktop] [cluster('frspme.westus2.kusto.windows.net').database('FRS')]

FRSRouterlicious
| where eventTimestamp > ago(1d)
| summarize count() by eventName
| order by count_ desc
| take 10


Table0

eventName count_
LogMessage 1,417,995,212
RiddlerFetchTenantKey 926,002,068
RunWithRetry 138,144,104
ScriptoriumProcessBatch 127,897,144
GetDeltaObjectStoreFallback 84,785,599
ReprocessOps 67,206,731
DisconnectDocument 63,701,361
ConnectDocument 58,862,813
ConnectDocumentGetClients 58,857,795
ConnectDocumentAddClient 52,860,490

Copy link
Contributor Author

@shubhi1092 shubhi1092 Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zhangxin511 , these are my thoughts and reasoning:

  1. I see the benefit of adding a new metric here to analyse the GetDeltasFromDb trend, its faster to check and plot the direct duration of a metric instead of extracting that from the properties of another metric.
  2. However I also see the benefit of adding this latency to the overall GetDeltaObjectStoreFallback's metric properties since its easier to check/correlate for a single call.

I am going to do # 2 anyways in FRS since we already have that metric. And since number of logs is a concern here, lets not do # 1 for now and will reconsider if needed after adding # 2.

}