add metric for count of RPC requests #15515

tgross · 2022-12-09T18:32:07Z

Implement a metric for RPC requests with labels on the identity, so that administrators can monitor the source of requests within the cluster. This changeset demonstrates the change with the new ACL.WhoAmI RPC, and we'll wire up the remaining RPCs once we've threaded the new pre-forwarding authentication through them all (#15513 plus follow-up work)

Note that metrics are measured after we forward but before we return any authentication error. This ensures that we only emit metrics on the server that actually serves the request. We'll perform rate limiting at the same place.

shoenig · 2022-12-09T20:56:54Z

@tgross mind rebasing? Looks like this got wrecked by #15512

nomad/rpc_rate_metrics.go

shoenig

LGTM!

jrasell

LGTM!

tgross · 2022-12-14T16:24:40Z

Just leaving a note that I'm having strong re-consideration of this approach following the investigation that's led to #15523 and #15522, so I'm holding off on merging this until we can have an internal discussion around it. Resolved below.

tgross · 2023-01-04T16:35:53Z

nomad/rpc_rate_metrics.go

+	if rpcCtx == nil {
+		return // we're the RPC caller and not the server
+	}


In some local testing I'm seeing this comment isn't quite accurate because it hides RPCs that are coming in from HTTP to the host that's taking the metric; I need to debug this a bit more to verify but we might end up dropping this conditional.

Is the intention to only count RPCs where they're actually served and skip counting them when forwarded? It definitely changes how we can use these metrics if as forwarding could really skew metrics (1-2x for writes, 1-2x for non-stale reads, completely arbitrary amount due to federation). Relative comparisons either between metrics or a single-metric-over-time could even be skewed by a leadership election causing a 2x bump as a write or non-stale read to a particular server now always gets forwarded.

So the intention from the RFC was to count whenever an RPC is served by a server or forwarded by a server (because that let's you detect problems like uneven distribution of RPCs from client nodes). If we only cared about metrics on the server that actually serves the request, we could dramatically simplify the problem because auth could happen after checking for forwarding. But I'm pretty sure we do want metrics pre-forwarding?

My intent with this conditional around rpcCtx == nil was that we probably don't want to track "static endpoint" RPCs where ex. the deploymentwatcher running on the leader is using RPCs as normal function calls. Unfortunately that also means that if you send a HTTP request to a server, the server that serves the HTTP request won't count it because there's no rpcCtx in that case (ex. if the HTTP request hits the leader).

That being said, I think the right approach here is to remove this conditional and just count those static endpoint calls. The identity will make it clear they're internal calls anyways.

My intent with this conditional around rpcCtx == nil was that we probably don't want to track "static endpoint" RPCs where ex. the deploymentwatcher running on the leader is using RPCs as normal function calls.

Ah that makes sense. I'd love to stop using those static endpoints anyway. I can't imagine there's not a better way to share that logic internally. Perhaps we should file an issue around this (eg Remove static endpoints to avoid counting them toward metrics or something) and ship this as-is.

Yeah the only consumers of the static endpoints now are deploymentwatcher and volumewatcher (following the work we did in #15451). But I bet we could extract some inner functions for the handful of methods we need and then inject those functions instead of the RPC handler. I'll open an issue for that.

tgross · 2023-01-05T20:14:29Z

Will need to rebase this on #15513 once that's been merged, as there's some changes that impact this one

shoenig · 2023-01-23T15:10:32Z

test failures include

  acl_test.go:268: expected matching error strings
  ↪ msg: "ACL token expired"
  ↪ err: "rpc error: ACL token expired"

tgross · 2023-01-23T15:21:12Z

test failures include

~~Bah, that keeps breaking/unbreaking as I've done rebases with the other PRs for this topic (like #15513). Will fix.~~ Done.

Implement a metric for RPC requests with labels on the identity, so that administrators can monitor the source of requests within the cluster. This changeset demonstrates the change with the new `ACL.WhoAmI` RPC, and we'll wire up the remaining RPCs once we've threaded the new pre-forwarding authentication through the all. Note that metrics are measured after we forward but before we return any authentication error. This ensures that we only emit metrics on the server that actually serves the request. We'll perform rate limiting at the same place.

tgross · 2023-01-24T15:54:57Z

Had to rebase this on the merged #15513 and I've got a telemetry config test I need to fix, and that should wrap this up.

This changeset configures the RPC rate metrics that were added in #15515 to all the RPCs that support authenticated HTTP API requests. These endpoints already configured with pre-forwarding authentication in #15870, and a handful of others were done already as part of the proof-of-concept work. So this changeset is entirely copy-and-pasting one method call into a whole mess of handlers. Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining set of RPCs that have no API consumers or aren't authenticated, in smaller chunks that can be more thoughtfully reviewed.

tgross requested review from schmichael, jrasell and shoenig December 9, 2022 18:33

tgross added the type/enhancement label Dec 9, 2022

tgross added this to the 1.5.0 milestone Dec 9, 2022

tgross marked this pull request as ready for review December 9, 2022 18:35

tgross changed the title ~~add metric for rate of RPC requests~~ add metric for count of RPC requests Dec 9, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui December 9, 2022 18:36 View deployment

tgross force-pushed the rpc-rate-metrics branch from 853626a to d9738e6 Compare December 9, 2022 21:20

vercel bot deployed to Preview – nomad-storybook-and-ui December 9, 2022 21:26 View deployment

shoenig reviewed Dec 9, 2022

View reviewed changes

nomad/rpc_rate_metrics.go Outdated Show resolved Hide resolved

shoenig approved these changes Dec 9, 2022

View reviewed changes

tgross force-pushed the rpc-rate-metrics branch from d9738e6 to 7a2872d Compare December 9, 2022 21:52

vercel bot deployed to Preview – nomad-storybook-and-ui December 9, 2022 21:57 View deployment

jrasell approved these changes Dec 12, 2022

View reviewed changes

tgross force-pushed the rpc-rate-metrics branch from 7a2872d to d96bcdd Compare January 4, 2023 15:18

vercel bot deployed to Preview – nomad-storybook-and-ui January 4, 2023 15:24 View deployment

tgross commented Jan 4, 2023

View reviewed changes

schmichael approved these changes Jan 5, 2023

View reviewed changes

tgross mentioned this pull request Jan 5, 2023

remove remaining static RPC endpoints #15702

Open

vercel bot deployed to Preview – nomad-storybook-and-ui January 6, 2023 15:18 View deployment

tgross force-pushed the rpc-rate-metrics branch from 2afd240 to 81c7c1e Compare January 10, 2023 18:37

vercel bot deployed to Preview – nomad-storybook-and-ui January 10, 2023 18:42 View deployment

tgross force-pushed the rpc-rate-metrics branch from 81c7c1e to 107a766 Compare January 11, 2023 15:43

vercel bot deployed to Preview – nomad-storybook-and-ui January 11, 2023 15:48 View deployment

tgross force-pushed the rpc-rate-metrics branch from 107a766 to 34dd5c0 Compare January 20, 2023 20:17

vercel bot deployed to Preview – nomad-storybook-and-ui January 20, 2023 20:23 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui January 23, 2023 15:28 View deployment

tgross added 3 commits January 24, 2023 10:53

add telemetry configuration to omit identity labels

0c748d8

fixup tests from rebase on main

2b249a3

fix telemetry parsing test

a28b2e1

tgross force-pushed the rpc-rate-metrics branch from b5379c7 to a28b2e1 Compare January 24, 2023 15:58

vercel bot deployed to Preview – nomad-storybook-and-ui January 24, 2023 16:02 View deployment

tgross merged commit bcd5bbd into main Jan 24, 2023

tgross deleted the rpc-rate-metrics branch January 24, 2023 16:54

tgross mentioned this pull request Jan 25, 2023

metrics: measure rate of RPC requests that serve API #15876

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metric for count of RPC requests #15515

add metric for count of RPC requests #15515

tgross commented Dec 9, 2022 •

edited

Loading

shoenig commented Dec 9, 2022

shoenig left a comment

jrasell left a comment

tgross commented Dec 14, 2022 •

edited

Loading

tgross Jan 4, 2023 •

edited

Loading

schmichael Jan 5, 2023

tgross Jan 5, 2023 •

edited

Loading

schmichael Jan 5, 2023

tgross Jan 5, 2023

tgross commented Jan 5, 2023

shoenig commented Jan 23, 2023

tgross commented Jan 23, 2023 •

edited

Loading

tgross commented Jan 24, 2023

add metric for count of RPC requests #15515

add metric for count of RPC requests #15515

Conversation

tgross commented Dec 9, 2022 • edited Loading

shoenig commented Dec 9, 2022

shoenig left a comment

Choose a reason for hiding this comment

jrasell left a comment

Choose a reason for hiding this comment

tgross commented Dec 14, 2022 • edited Loading

tgross Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

schmichael Jan 5, 2023

Choose a reason for hiding this comment

tgross Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

schmichael Jan 5, 2023

Choose a reason for hiding this comment

tgross Jan 5, 2023

Choose a reason for hiding this comment

tgross commented Jan 5, 2023

shoenig commented Jan 23, 2023

tgross commented Jan 23, 2023 • edited Loading

tgross commented Jan 24, 2023

tgross commented Dec 9, 2022 •

edited

Loading

tgross commented Dec 14, 2022 •

edited

Loading

tgross Jan 4, 2023 •

edited

Loading

tgross Jan 5, 2023 •

edited

Loading

tgross commented Jan 23, 2023 •

edited

Loading