Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metric for count of RPC requests #15515

Merged
merged 4 commits into from
Jan 24, 2023
Merged

add metric for count of RPC requests #15515

merged 4 commits into from
Jan 24, 2023

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 9, 2022

Implement a metric for RPC requests with labels on the identity, so that administrators can monitor the source of requests within the cluster. This changeset demonstrates the change with the new ACL.WhoAmI RPC, and we'll wire up the remaining RPCs once we've threaded the new pre-forwarding authentication through them all (#15513 plus follow-up work)

Note that metrics are measured after we forward but before we return any authentication error. This ensures that we only emit metrics on the server that actually serves the request. We'll perform rate limiting at the same place.

@tgross tgross added this to the 1.5.0 milestone Dec 9, 2022
@tgross tgross marked this pull request as ready for review December 9, 2022 18:35
@tgross tgross changed the title add metric for rate of RPC requests add metric for count of RPC requests Dec 9, 2022
@shoenig
Copy link
Contributor

shoenig commented Dec 9, 2022

@tgross mind rebasing? Looks like this got wrecked by #15512

nomad/rpc_rate_metrics.go Outdated Show resolved Hide resolved
Copy link
Contributor

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgross
Copy link
Member Author

tgross commented Dec 14, 2022

Just leaving a note that I'm having strong re-consideration of this approach following the investigation that's led to #15523 and #15522, so I'm holding off on merging this until we can have an internal discussion around it. Resolved below.

Comment on lines 15 to 17
if rpcCtx == nil {
return // we're the RPC caller and not the server
}
Copy link
Member Author

@tgross tgross Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some local testing I'm seeing this comment isn't quite accurate because it hides RPCs that are coming in from HTTP to the host that's taking the metric; I need to debug this a bit more to verify but we might end up dropping this conditional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to only count RPCs where they're actually served and skip counting them when forwarded? It definitely changes how we can use these metrics if as forwarding could really skew metrics (1-2x for writes, 1-2x for non-stale reads, completely arbitrary amount due to federation). Relative comparisons either between metrics or a single-metric-over-time could even be skewed by a leadership election causing a 2x bump as a write or non-stale read to a particular server now always gets forwarded.

Copy link
Member Author

@tgross tgross Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the intention from the RFC was to count whenever an RPC is served by a server or forwarded by a server (because that let's you detect problems like uneven distribution of RPCs from client nodes). If we only cared about metrics on the server that actually serves the request, we could dramatically simplify the problem because auth could happen after checking for forwarding. But I'm pretty sure we do want metrics pre-forwarding?

My intent with this conditional around rpcCtx == nil was that we probably don't want to track "static endpoint" RPCs where ex. the deploymentwatcher running on the leader is using RPCs as normal function calls. Unfortunately that also means that if you send a HTTP request to a server, the server that serves the HTTP request won't count it because there's no rpcCtx in that case (ex. if the HTTP request hits the leader).

That being said, I think the right approach here is to remove this conditional and just count those static endpoint calls. The identity will make it clear they're internal calls anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent with this conditional around rpcCtx == nil was that we probably don't want to track "static endpoint" RPCs where ex. the deploymentwatcher running on the leader is using RPCs as normal function calls.

Ah that makes sense. I'd love to stop using those static endpoints anyway. I can't imagine there's not a better way to share that logic internally. Perhaps we should file an issue around this (eg Remove static endpoints to avoid counting them toward metrics or something) and ship this as-is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the only consumers of the static endpoints now are deploymentwatcher and volumewatcher (following the work we did in #15451). But I bet we could extract some inner functions for the handful of methods we need and then inject those functions instead of the RPC handler. I'll open an issue for that.

@tgross
Copy link
Member Author

tgross commented Jan 5, 2023

Will need to rebase this on #15513 once that's been merged, as there's some changes that impact this one

@shoenig
Copy link
Contributor

shoenig commented Jan 23, 2023

test failures include

  acl_test.go:268: expected matching error strings
  ↪ msg: "ACL token expired"
  ↪ err: "rpc error: ACL token expired"

@tgross
Copy link
Member Author

tgross commented Jan 23, 2023

test failures include

Bah, that keeps breaking/unbreaking as I've done rebases with the other PRs for this topic (like #15513). Will fix. Done.

Implement a metric for RPC requests with labels on the identity, so that
administrators can monitor the source of requests within the cluster. This
changeset demonstrates the change with the new `ACL.WhoAmI` RPC, and we'll wire
up the remaining RPCs once we've threaded the new pre-forwarding authentication
through the all.

Note that metrics are measured after we forward but before we return any
authentication error. This ensures that we only emit metrics on the server that
actually serves the request. We'll perform rate limiting at the same place.
@tgross
Copy link
Member Author

tgross commented Jan 24, 2023

Had to rebase this on the merged #15513 and I've got a telemetry config test I need to fix, and that should wrap this up.

@tgross tgross merged commit bcd5bbd into main Jan 24, 2023
@tgross tgross deleted the rpc-rate-metrics branch January 24, 2023 16:54
tgross added a commit that referenced this pull request Jan 25, 2023
This changeset configures the RPC rate metrics that were added in #15515 to all
the RPCs that support authenticated HTTP API requests. These endpoints already
configured with pre-forwarding authentication in #15870, and a handful of others
were done already as part of the proof-of-concept work. So this changeset is
entirely copy-and-pasting one method call into a whole mess of handlers.

Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining
set of RPCs that have no API consumers or aren't authenticated, in smaller
chunks that can be more thoughtfully reviewed.
tgross added a commit that referenced this pull request Jan 25, 2023
This changeset configures the RPC rate metrics that were added in #15515 to all
the RPCs that support authenticated HTTP API requests. These endpoints already
configured with pre-forwarding authentication in #15870, and a handful of others
were done already as part of the proof-of-concept work. So this changeset is
entirely copy-and-pasting one method call into a whole mess of handlers.

Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining
set of RPCs that have no API consumers or aren't authenticated, in smaller
chunks that can be more thoughtfully reviewed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants