Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move AsyncBatchingWorkQueue usage in telemetry to TelemetryLogging level #73287

Merged

Conversation

ToddGrun
Copy link
Contributor

@ToddGrun ToddGrun commented May 1, 2024

Doing this as I noticed a large (10x) difference in the number of requestcounter and requestduration events in our dashboard. These events go through the system in slightly different fashions, requestcounter goes through standard telemetry calls on disposal whereas requestduration goes through the aggregated telemetry logging.

Both of these are intended to aggregate multiple calls into a single message, but the cadence at which they send telemetry differs. They both flush on project/VS close, but the requestduration method also flushes every 30 minutes.

I've noticed that VS shutdown is now more abrupt than previously, often not giving our disposers a chance to send out telemetry. This is why I believe there is such a large discepency in the telemetry numbers for these methods, when they should be the same. This PR allows for the requestcounter messages to also be sent out every 30 minutes in case the disposal codepath isn't executed.

image

Doing this as I noticed a large (10x) difference in the number of requestcounter and requestduration events in our dashboard. These events go through the system in slightly different fashions, requestcounter goes through standard telemetry calls on disposal whereas requestduration goes through the aggregated telemetry logging.

Both of these are intended to aggregate multiple calls into a single message, but the cadence at which they send telemetry differs. The requestduration method will flush both on project/VS close and every 30 minutes, whereas the requestcounter method only flushes on project/VS close.

I've noticed that VS shutdown is now more abrupt than previously, often not giving our disposers a chance to send out telemetry. This is why I believe there is such a large discepency in the telemetry numbers for these methods, when they should be the same. This PR allows for the requestcounter messages to also be sent out every 30 minutes in case the disposal codepath isn't executed.
@ToddGrun ToddGrun requested a review from dibarbet May 1, 2024 00:21
@ToddGrun ToddGrun requested a review from a team as a code owner May 1, 2024 00:21
@dotnet-issue-labeler dotnet-issue-labeler bot added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels May 1, 2024
public static void SetLogProvider(ITelemetryLogProvider logProvider)
public static event EventHandler<EventArgs>? Flushed;

public static void SetLogProvider(ITelemetryLogProvider logProvider, IAsynchronousOperationListener asyncListener)
{
s_logProvider = logProvider;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw if there's already a log provider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undid as it caused test failures

TelemetryLogging.Flushed -= OnFlushed;
}

private void OnFlushed(object? sender, EventArgs e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if the queries are resilient to multiple events for the same server for the same session. I assume they are since I don't think we drill down into the session in particular, but I can't remember.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queries that I know don't drill into the session, so they are resilient. If we find some that do use the session, it seems feasible to change them to allow multiple items from a session.

}

private void OnFlushed(object? sender, EventArgs e)
{
foreach (var kvp in _requestCounters)
{
TelemetryLogging.Log(FunctionId.LSP_RequestCounter, KeyValueLogMessage.Create(LogType.Trace, m =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to figure out why these didn't use TelemetryLogging.LogAggregated - but its because we're not using a bucket based aggregation here right? We're just logging pure sums.

Wondering if we should have another variant of the LogAggregated that does a sum or something. But maybe a change for a later date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly right. I considered that as part of this, but it seemed like overkill for now. Especially, as I know we have more upcoming work in this area as part of potentially moving towards OTel.

@@ -112,5 +135,19 @@ public static void LogAggregated(FunctionId functionId, KeyValueLogMessage logMe
public static void Flush()
{
s_logProvider?.Flush();

Flushed?.Invoke(null, EventArgs.Empty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite sure i get the point of the events.


public const string KeyName = "Name";
public const string KeyValue = "Value";
public const string KeyLanguageName = "LanguageName";
public const string KeyMetricName = "MetricName";

public static void SetLogProvider(ITelemetryLogProvider logProvider)
public static event EventHandler<EventArgs>? Flushed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the event for?

Copy link
Contributor Author

@ToddGrun ToddGrun May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestTelemetryLogger not only uses TelemetryLogging for aggregated telemetry, but also accumulation it's own counts that it will send to telemetry itself.

RequestTelemetryLogger uses Dispose to both notify this object to Flush and also handles firing the telemetry counts it is accumulating. However, Dispose isn't a reliable mechanism by which to fire telemetry, as the process might be terminated before we get the opportunity to act.

To handle that, the aggregating telemetry code previously fired off telemetry every 30 minutes, using an ABWQ. This PR moves that out to the TelemetryLogging level, but it's still not hooked into the telemetry that fires due to the accounting in RequestTelemetryLogger.

This event allows RequestTelemetryLogger to hook into when that ABWQ derived telemetry firing is happening, and when it does RequestTelemetryLogger can fire off the telemetry that it is accounting for.

Flush();

return ValueTaskFactory.CompletedTask;
return ImmutableInterlocked.GetOrAdd(ref _aggregatingLogs, functionId, functionId => new AggregatingTelemetryLog(_session, functionId, bucketBoundaries));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often is this called? Consider using a static lambda here to avoid allocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called a lot, nice catch!

{
s_logProvider = logProvider;

if (s_postTelemetryQueue is null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Use a InterlockedOperations.Initialize helper here.

if (null == Interlocked.CompareExchange(ref s_postTelemetryQueue, postTelemetryQueue, null))
{
// We created the work queue in use. Add an item into it to kick things off.
s_postTelemetryQueue?.AddWork();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 It's fine if this gets called more than once. The work queue will aggregate duplicate requests into a single one.

2) As AddWork batches work together, no need to protect against multiple calls
3) Get rid of closure allocation in commonly called telemetry method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants