-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move AsyncBatchingWorkQueue usage in telemetry to TelemetryLogging level #73287
Move AsyncBatchingWorkQueue usage in telemetry to TelemetryLogging level #73287
Conversation
Doing this as I noticed a large (10x) difference in the number of requestcounter and requestduration events in our dashboard. These events go through the system in slightly different fashions, requestcounter goes through standard telemetry calls on disposal whereas requestduration goes through the aggregated telemetry logging. Both of these are intended to aggregate multiple calls into a single message, but the cadence at which they send telemetry differs. The requestduration method will flush both on project/VS close and every 30 minutes, whereas the requestcounter method only flushes on project/VS close. I've noticed that VS shutdown is now more abrupt than previously, often not giving our disposers a chance to send out telemetry. This is why I believe there is such a large discepency in the telemetry numbers for these methods, when they should be the same. This PR allows for the requestcounter messages to also be sent out every 30 minutes in case the disposal codepath isn't executed.
public static void SetLogProvider(ITelemetryLogProvider logProvider) | ||
public static event EventHandler<EventArgs>? Flushed; | ||
|
||
public static void SetLogProvider(ITelemetryLogProvider logProvider, IAsynchronousOperationListener asyncListener) | ||
{ | ||
s_logProvider = logProvider; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this throw if there's already a log provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undid as it caused test failures
TelemetryLogging.Flushed -= OnFlushed; | ||
} | ||
|
||
private void OnFlushed(object? sender, EventArgs e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember if the queries are resilient to multiple events for the same server for the same session. I assume they are since I don't think we drill down into the session in particular, but I can't remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queries that I know don't drill into the session, so they are resilient. If we find some that do use the session, it seems feasible to change them to allow multiple items from a session.
} | ||
|
||
private void OnFlushed(object? sender, EventArgs e) | ||
{ | ||
foreach (var kvp in _requestCounters) | ||
{ | ||
TelemetryLogging.Log(FunctionId.LSP_RequestCounter, KeyValueLogMessage.Create(LogType.Trace, m => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to figure out why these didn't use TelemetryLogging.LogAggregated
- but its because we're not using a bucket based aggregation here right? We're just logging pure sums.
Wondering if we should have another variant of the LogAggregated that does a sum or something. But maybe a change for a later date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly right. I considered that as part of this, but it seemed like overkill for now. Especially, as I know we have more upcoming work in this area as part of potentially moving towards OTel.
@@ -112,5 +135,19 @@ public static void LogAggregated(FunctionId functionId, KeyValueLogMessage logMe | |||
public static void Flush() | |||
{ | |||
s_logProvider?.Flush(); | |||
|
|||
Flushed?.Invoke(null, EventArgs.Empty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not quite sure i get the point of the events.
|
||
public const string KeyName = "Name"; | ||
public const string KeyValue = "Value"; | ||
public const string KeyLanguageName = "LanguageName"; | ||
public const string KeyMetricName = "MetricName"; | ||
|
||
public static void SetLogProvider(ITelemetryLogProvider logProvider) | ||
public static event EventHandler<EventArgs>? Flushed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the event for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RequestTelemetryLogger not only uses TelemetryLogging for aggregated telemetry, but also accumulation it's own counts that it will send to telemetry itself.
RequestTelemetryLogger uses Dispose to both notify this object to Flush and also handles firing the telemetry counts it is accumulating. However, Dispose isn't a reliable mechanism by which to fire telemetry, as the process might be terminated before we get the opportunity to act.
To handle that, the aggregating telemetry code previously fired off telemetry every 30 minutes, using an ABWQ. This PR moves that out to the TelemetryLogging level, but it's still not hooked into the telemetry that fires due to the accounting in RequestTelemetryLogger.
This event allows RequestTelemetryLogger to hook into when that ABWQ derived telemetry firing is happening, and when it does RequestTelemetryLogger can fire off the telemetry that it is accounting for.
Flush(); | ||
|
||
return ValueTaskFactory.CompletedTask; | ||
return ImmutableInterlocked.GetOrAdd(ref _aggregatingLogs, functionId, functionId => new AggregatingTelemetryLog(_session, functionId, bucketBoundaries)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often is this called? Consider using a static lambda here to avoid allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
called a lot, nice catch!
{ | ||
s_logProvider = logProvider; | ||
|
||
if (s_postTelemetryQueue is null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Use a InterlockedOperations.Initialize helper here.
if (null == Interlocked.CompareExchange(ref s_postTelemetryQueue, postTelemetryQueue, null)) | ||
{ | ||
// We created the work queue in use. Add an item into it to kick things off. | ||
s_postTelemetryQueue?.AddWork(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 It's fine if this gets called more than once. The work queue will aggregate duplicate requests into a single one.
2) As AddWork batches work together, no need to protect against multiple calls 3) Get rid of closure allocation in commonly called telemetry method
Doing this as I noticed a large (10x) difference in the number of requestcounter and requestduration events in our dashboard. These events go through the system in slightly different fashions, requestcounter goes through standard telemetry calls on disposal whereas requestduration goes through the aggregated telemetry logging.
Both of these are intended to aggregate multiple calls into a single message, but the cadence at which they send telemetry differs. They both flush on project/VS close, but the requestduration method also flushes every 30 minutes.
I've noticed that VS shutdown is now more abrupt than previously, often not giving our disposers a chance to send out telemetry. This is why I believe there is such a large discepency in the telemetry numbers for these methods, when they should be the same. This PR allows for the requestcounter messages to also be sent out every 30 minutes in case the disposal codepath isn't executed.