Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: Remove xds authority label from metric registration #11760

Merged
merged 2 commits into from
Dec 21, 2024

Conversation

DNVindhya
Copy link
Contributor

@DNVindhya DNVindhya commented Dec 17, 2024

XdsClient metrics were added in #11661. It is missing grpc.xds.authority label for grpc.xds_client.resources gauge.

While label value is absent, label got added while registering the gauge. This leads to java.lang.IllegalArgumentException: Incorrect number of required labels provided. on invoking callback. Stack trace is added below:

00:25:09.808 [PeriodicMetricReader-1] WARN  i.o.s.m.i.state.CallbackRegistration - An exception occurred invoking callback for CallbackRegistration{instrumentDescriptors=[InstrumentDescriptor{name=grpc.xds_client.resources, description=EXPERIMENTAL.  Number of xDS resources., unit={resource}, type=OBSERVABLE_GAUGE, valueType=LONG, advice=Advice{explicitBucketBoundaries=null, attributes=null}}, InstrumentDescriptor{name=grpc.xds_client.connected, description=EXPERIMENTAL. Whether or not the xDS client currently has a working ADS stream to the xDS server. For a given server, this will be set to 1 when the stream is initially created. It will be set to 0 when we have a connectivity failure or when the ADS stream fails without seeing a response message, as per gRFC A57. Once set to 0, it will be reset to 1 when we receive the first response on an ADS stream., unit={bool}, type=OBSERVABLE_GAUGE, valueType=LONG, advice=Advice{explicitBucketBoundaries=null, attributes=null}}]}.
java.lang.IllegalArgumentException: Incorrect number of required labels provided. Expected: 4
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:191)
    at io.grpc.MetricRecorder$BatchRecorder.recordLongGauge(MetricRecorder.java:138)
    at io.grpc.internal.MetricRecorderImpl$BatchRecorderImpl.recordLongGauge(MetricRecorderImpl.java:199)
    at io.grpc.xds.XdsClientMetricReporterImpl$MetricReporterCallback.reportResourceCountGauge(XdsClientMetricReporterImpl.java:205)
    at io.grpc.xds.XdsClientMetricReporterImpl.lambda$computeAndReportResourceCounts$1(XdsClientMetricReporterImpl.java:172)
    at java.base/java.util.HashMap.forEach(HashMap.java:1337)
    at io.grpc.xds.XdsClientMetricReporterImpl.computeAndReportResourceCounts(XdsClientMetricReporterImpl.java:171)
    at io.grpc.xds.XdsClientMetricReporterImpl.reportCallbackMetrics(XdsClientMetricReporterImpl.java:146)
    at io.grpc.xds.XdsClientMetricReporterImpl$1.accept(XdsClientMetricReporterImpl.java:123)
    at io.grpc.internal.MetricRecorderImpl.lambda$registerBatchCallback$0(MetricRecorderImpl.java:177)
    at io.opentelemetry.sdk.metrics.internal.state.CallbackRegistration.invokeCallback(CallbackRegistration.java:84)
    at io.opentelemetry.sdk.metrics.SdkMeter.collectAll(SdkMeter.java:112)
    at io.opentelemetry.sdk.metrics.SdkMeterProvider$LeasedMetricProducer.produce(SdkMeterProvider.java:208)
    at io.opentelemetry.sdk.metrics.SdkMeterProvider$SdkCollectionRegistration.collectAllMetrics(SdkMeterProvider.java:232)
    at io.opentelemetry.sdk.metrics.export.PeriodicMetricReader$Scheduled.doRun(PeriodicMetricReader.java:161)
    at io.opentelemetry.sdk.metrics.export.PeriodicMetricReader$Scheduled.run(PeriodicMetricReader.java:153)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

As a fix, we are removing the label from gauge registration until gpc.xds.authority label value is available.

…resources` gauge, until the label value is available to record.
@DNVindhya DNVindhya requested a review from ejona86 December 17, 2024 18:22
Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who would experience the crash? Why didn't tests notice?

@ejona86
Copy link
Member

ejona86 commented Dec 17, 2024

Oh, this is a failure only within the gauge callback, which is some random thread. So the failure is either just logged or it prevents OTel from publishing. But gRPC itself would appear operational.

@ejona86 ejona86 added the TODO:backport PR needs to be backported. Removed after backport complete label Dec 17, 2024
@DNVindhya
Copy link
Contributor Author

DNVindhya commented Dec 18, 2024

Who would experience the crash? Why didn't tests notice?

Cloud Big Table reported this issue. I have captured the details provided by them in b/384738716.

We didn't catch in tests because we mocked BatchRecorder.

@ejona86
Copy link
Member

ejona86 commented Dec 18, 2024

We didn't catch in tests because we mocked BatchRecorder.

We should use delegatesTo() with the mocks in tests, so the default checks are applied, then.

@DNVindhya
Copy link
Contributor Author

We didn't catch in tests because we mocked BatchRecorder.

We should use delegatesTo() with the mocks in tests, so the default checks are applied, then.

Updates to use delegatesTo() in unit test.

@DNVindhya DNVindhya merged commit 6516c73 into grpc:master Dec 21, 2024
11 of 12 checks passed
@DNVindhya DNVindhya deleted the xds-authority-label branch December 21, 2024 03:50
DNVindhya added a commit to DNVindhya/grpc-java that referenced this pull request Jan 14, 2025
* Remove `grpc.xds.authority` label while registering `grpc.xds_client.resources` gauge, until the label value is available to record.
@DNVindhya DNVindhya removed the TODO:backport PR needs to be backported. Removed after backport complete label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants