Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use updated onPreAuth from Platform #71552

Merged
merged 7 commits into from
Jul 14, 2020

Conversation

jfsiii
Copy link
Contributor

@jfsiii jfsiii commented Jul 13, 2020

Summary

Track the number of open requests to certain routes and reject them with a 429 if we are above that limit.

The limit is controlled via the new config value xpack.ingestManager.fleet.maxConcurrentConnections. As a new flag it won't be immediately available in Cloud UI. I'll double check the timeline, but it's possible it won't be available for 7.9.

This is implemented with the updated onPreAuth platform lifecycle from #70775 and code/ideas from #70495 & https://github.com/jfsiii/kibana/pull/4

We add a global interceptor (meaning it's run on every request, not just those for our plugin) which uses a single counter to track total open connections to any route which includes the LIMITED_CONCURRENCY_ROUTE_TAG tag. The counter will increment when the request starts and decrement on close. The only routes which have this logic are the Fleet enroll & ack routes; not checkin

Open Questions

  1. What features must we have in before Feature Freeze?
  2. How do we set expose the flag to users? What name? Should it be under xpack.ingestManager.fleet?
  3. What's a good default value the limit? Can/should we set set it based on the characteristics of the container?

Answers

  1. feature exposed via flag, only applies to enroll & ack routes
  2. went with xpack.ingestManager.fleet.maxConcurrentConnections
  3. off by default and we can give advice for better values after more testing)

Checklist

PR vs now from a PR last week. Will update with up-to-date comparison
Screen Shot 2020-07-09 at 12 40 06 PM

@jfsiii jfsiii added the Team:Fleet Team label for Observability Data Collection Fleet team label Jul 13, 2020
@jfsiii jfsiii requested a review from a team July 13, 2020 21:37
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

return tags.includes(LIMITED_CONCURRENCY_ROUTE_TAG);
}

const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should this be configurable? Do we expose as a config flag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to having it as a config option, it fits nicely under xpack.ingestManager.fleet

the default value of 250 seems low to me, but I'm not up to date on our load testing results

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed seems low, how did you get to that number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolving this since we've switched to 0 (off) by default

import { KibanaRequest, LifecycleResponseFactory, OnPreAuthToolkit } from 'kibana/server';
import { LIMITED_CONCURRENCY_ROUTE_TAG } from '../../common';

class MaxCounter {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can handle this any number of ways. I just chose something that allowed the logic to be defined outside the route handler

@jfsiii jfsiii requested review from roncohen, ph and a team July 13, 2020 21:44
const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;
const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS);

export function preAuthHandler(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I test this handler? If so, can someone point me in the right direction :) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for testing, maybe looking at routes/package_config/handlers.test.ts would be helpful? it has usage of httpServerMock.createKibanaRequest for creating a request object, that method has options for creating it with tags, etc.

then I think you could mock LIMITED_CONCURRENCY_MAX_REQUESTS to 1 and mock toolkit.next() to resolve after waiting for a second. then kicking off multiple calls to the handler inside a Promise.all should return an array of of responses whose status codes are [200, 503, 503, ...]

return tags.includes(LIMITED_CONCURRENCY_ROUTE_TAG);
}

const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to having it as a config option, it fits nicely under xpack.ingestManager.fleet

the default value of 250 seems low to me, but I'm not up to date on our load testing results

const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;
const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS);

export function preAuthHandler(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to name this something Fleet or agent-related to emphasize that this handler is for those routes


counter.increase();

// requests.events.aborted$ has a bug where it's fired even when the request completes...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we decrease the counter if/when the bug is fixed? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion about this starting in #70495 (comment)

It's a "bug" but there is a test to ensure it works. There's also a possibility Platform will add a request.events.completed$ which we can use instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a "bug" but there is a test to ensure it works. There's also a possibility Platform will add a request.events.completed$ which we can use instead

This is entirely pedantic, it's not really a bug as much as an "implementation detail" per #70495 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;
const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS);

export function preAuthHandler(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for testing, maybe looking at routes/package_config/handlers.test.ts would be helpful? it has usage of httpServerMock.createKibanaRequest for creating a request object, that method has options for creating it with tags, etc.

then I think you could mock LIMITED_CONCURRENCY_MAX_REQUESTS to 1 and mock toolkit.next() to resolve after waiting for a second. then kicking off multiple calls to the handler inside a Promise.all should return an array of of responses whose status codes are [200, 503, 503, ...]

@jfsiii
Copy link
Contributor Author

jfsiii commented Jul 14, 2020

@elasticmachine merge upstream

@ph
Copy link
Contributor

ph commented Jul 14, 2020

@jfsiii Can you clarify how it work?

Any route which includes the LIMITED_CONCURRENCY_ROUTE_TAG tag will be counted. The routes are currently certain Fleet Enroll, Ack, and Checkin routes.

What is the actual behavior here? does it:

  • Set a hard limit to the number of simultaneous connected client? Like a hard limit X, not more than X clients can keep an open connection to the checkin endpoint.
  • Or it actually prevents bust and throttles them?

@nchaulet could you take a look?

@jfsiii
Copy link
Contributor Author

jfsiii commented Jul 14, 2020

@ph I updated the description to include more details

@roncohen
Copy link
Contributor

thanks @jfsiii!

  1. PR description says 429 but the code says 503 AFAICT. Let's change the code to produce 429s.
  2. Let's remove the "retry-after" header and implement exponential backoff in the agent for all these endpoints cc @michalpristas
  3. I understand it's a global request limit of 250 counting all the endpoints together. Since we're doing long polling, if you enroll more than LIMITED_CONCURRENCY_MAX_REQUESTS agents, all requests to any of the endpoints will fail after that. The behavior I think we want is that after an agent is checked-in and long-polling, we decrement the counter to allow for more agents to check-in. We're not trying to limit the amount of agents that can be long-polling, just trying to protect against a surge of requests coming in a the same time.
  4. I think we might need to keep separate counters and perhaps separate limits for the the endpoints eventually, but let's test more first.
  5. After applying (3), I suspect we need to set the limits lower. More testing is needed, but I'd set it to something like 50 for now.

@jfsiii
Copy link
Contributor Author

jfsiii commented Jul 14, 2020

thanks, @roncohen

  1. PR description says 429 but the code says 503 AFAICT. Let's change the code to produce 429s.

I updated the description. Can you say more about why 429 vs 503? I think a 503 is more accurate as it's a temporary system/server issue, not one with the request. A 429 would be if they were exceeding our API request rate, but we don't have one.

  1. Let's remove the "retry-after" header and implement exponential backoff in the agent for all these endpoints cc @michalpristas

Adding the Retry-After header doesn't initiate or enforce anything. It's there as an indication of how long the server will be busy. It still up to the clients when they retry.

  1. I think we might need to keep separate counters and perhaps separate limits for the the endpoints eventually, but let's test more first.

I wondered the same thing and figured we could come back to it.

Thanks for (3). I'll ping people in the code comments.


// requests.events.aborted$ has a bug where it's fired even when the request completes...
// we can take advantage of this bug just for load testing...
request.events.aborted$.toPromise().then(() => counter.decrease());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment says this decrements the counter when a request completes. However, /checkin is using long-polling so I don't think the request "completes" in the same way.

As @roncohen mentions in (3) in #71552 (comment), I think the counter will increment with each /checkin but not decrement.

It seems like we could revert to using core.http.registerOnPreResponse to decrement during a checkin response , but the reason we moved away from that was it missed aborted/failed connections #70495 (comment) and as the first line says requests.events.aborted$ has a bug where it's fired even when the request completes so that will result in double-decrementing.

Perhaps we could do some checking in both the response and aborted handler to see if theres a 429/503 like the initial POC https://github.com/elastic/kibana/pull/70495/files#diff-f3ee51f84520a4eb03ca041ff4c3d9a2R182

cc @kobelb @nchaulet

Copy link
Contributor

@roncohen roncohen Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfsiii I think we need to do the decrementing inside the check-in route when the actual work of checking in the agent has completed and the request transitions to long polling mode. Could the interceptor augment the context passed to the route such that the route can decrement when it sees fit? And then we need to ensure that a given request can only be decremented once, so it doesn't get decremented again when the response finally gets send back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @roncohen here, I think the behavior is to "protect from multiple connection" until enough agents have transitioned to long polling where their usage become low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the interceptor augment the context passed to the route such that the route can decrement when it sees fit?

I don't believe that it can at the moment... I think we have two options here

  1. Don't use the pre-auth handler to reply with a 429 for these long-polling routes. If we're doing all of the rate-limiting within the route handler itself, this becomes possible without any changes from the Kibana platform.
  2. Make changes to the Kibana platform to allow us to correlate the request seen in the pre-auth handler to the request within the route handler and the other lifecycle methods. This would allow us to only decrement the counter once by using a WeakSet. @joshdover, is there some other primitive or concept within the Kibana platform that we could use to fulfill this behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. is there some other primitive or concept within the Kibana platform that we could use to fulfill this behavior?

Not that is exposed currently. I am working on adding support for an id property on KibanaRequest that would be stable across lifecycle methods (#71019), but that will not make it for 7.9. I'm also not sure it would be safe to use for this use-case because the id could come from a proxy that is sending an X-Opaque-Id header which we can't guarantee will be unique? Maybe we need an internal ID that is unique that we can rely on?

We do keep a reference to the original Hapi request on the KibanaRequest object but it's been very purposefully hidden so that plugins can't abuse it.

@roncohen
Copy link
Contributor

Can you say more about why 429 vs 503? I think a 503 is more accurate as it's a temporary system/server issue, not one with the request. A 429 would be if they were exceeding our API request rate, but we don't have one.

The way I see it, 503 means the service is unavailable. In this case, it's not actually unavailable, it's just that there are currently too many requests going on to specific routes and the server is opting to reject some of those requests to make sure it can deal properly with the ones that are ongoing and allow other kinds of request (from the user browsing the UI for example) to proceed. So on the contrary, the service is alive and well because it's protecting itself ;)

If there's no connection to Elasticsearch, Kibana will return a 503 AFAIK. There are probably other scenarios where Kibana returns a 503. If we used 503s here, someone looking at HTTP response code graphs for Kibana will not be able to tell whether there's a significant problem like Elasticsearch being unavailable or it's just the protection from this PR being activated.

Finally, 429 doesn't have to mean that a particular user/account is making too many requests. The spec says:

Note that this specification does not define how the origin server
identifies the user, nor how it counts requests. For example, an
origin server that is limiting request rates can do so based upon
counts of requests on a per-resource basis, across the entire server,
or even among a set of servers. Likewise, it might identify the user
by its authentication credentials, or a stateful cookie.

@ph
Copy link
Contributor

ph commented Jul 14, 2020

I didn't know that Kibana could return a 503, in that case going with 429 might be a better solution, and this is the exact behavior with Elasticsearch when too many bulk requests are made to the server.

@jfsiii jfsiii requested a review from jen-huang July 14, 2020 15:50
@jfsiii
Copy link
Contributor Author

jfsiii commented Jul 14, 2020

@elasticmachine merge upstream

Copy link
Contributor

@jen-huang jen-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a smoke test locally enrolling 1 agent, collecting system metrics, and deploying an updated config, everything looks good.

Giving 👍 due to time constraints, but I would like to see test coverage added for limited_concurrency.ts. At the minimum validate that we can get 429 status codes, and also validate that the counter decreases accordingly.

@jfsiii
Copy link
Contributor Author

jfsiii commented Jul 14, 2020

@jen-huang thanks! I put a test to confirm we don't add the preAuth handler unless we have a config.

We'll see what CI holds, but this is what I got locally

 PASS  plugins/ingest_manager/server/routes/limited_concurrency.test.ts
  registerLimitedConcurrencyRoutes
    ✓ doesn't call registerOnPreAuth if maxConcurrentConnections is 0 (10ms)
    ✓ calls registerOnPreAuth once if maxConcurrentConnections is 1 (5ms)
    ✓ calls registerOnPreAuth once if maxConcurrentConnections is 1000 (3ms)

I'll try to add some more (like "do we only process the correct routes?") before merging, but I'll create a link for follow tests and link it when I merge this.

ph added a commit to ph/beats that referenced this pull request Jul 14, 2020
When enrolling and the server currently handle to many concurrent
request it will return a 429 status code. The enroll subcommand will
retry to enroll with an exponential backoff. (Init 15sec and max 10mins)

This also adjust the backoff logic in the ACK.

Requires: elastic/kibana#71552
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@jfsiii jfsiii merged commit 04cdb5a into elastic:master Jul 14, 2020
ph added a commit to elastic/beats that referenced this pull request Jul 14, 2020
#19918)

* [Elastic Agent] Handle 429 response from the server and adjust backoff

When enrolling and the server currently handle to many concurrent
request it will return a 429 status code. The enroll subcommand will
retry to enroll with an exponential backoff. (Init 15sec and max 10mins)

This also adjust the backoff logic in the ACK.

Requires: elastic/kibana#71552

* changelog

* Change values
ph added a commit to ph/beats that referenced this pull request Jul 14, 2020
elastic#19918)

* [Elastic Agent] Handle 429 response from the server and adjust backoff

When enrolling and the server currently handle to many concurrent
request it will return a 429 status code. The enroll subcommand will
retry to enroll with an exponential backoff. (Init 15sec and max 10mins)

This also adjust the backoff logic in the ACK.

Requires: elastic/kibana#71552

* changelog

* Change values

(cherry picked from commit 2db2152)
ph added a commit to elastic/beats that referenced this pull request Jul 14, 2020
#19918) (#19926)

* [Elastic Agent] Handle 429 response from the server and adjust backoff

When enrolling and the server currently handle to many concurrent
request it will return a 429 status code. The enroll subcommand will
retry to enroll with an exponential backoff. (Init 15sec and max 10mins)

This also adjust the backoff logic in the ACK.

Requires: elastic/kibana#71552

* changelog

* Change values

(cherry picked from commit 2db2152)
jfsiii pushed a commit that referenced this pull request Jul 15, 2020
* Use updated onPreAuth from Platform

* Add config flag. Increase default value.

* Set max connections flag default to 0 (disabled)

* Don't use limiting logic on checkin route

* Confirm preAuth handler only added when max > 0

Co-authored-by: Elastic Machine <[email protected]>

Co-authored-by: Elastic Machine <[email protected]>
melchiormoulin pushed a commit to melchiormoulin/beats that referenced this pull request Oct 14, 2020
elastic#19918)

* [Elastic Agent] Handle 429 response from the server and adjust backoff

When enrolling and the server currently handle to many concurrent
request it will return a 429 status code. The enroll subcommand will
retry to enroll with an exponential backoff. (Init 15sec and max 10mins)

This also adjust the backoff logic in the ACK.

Requires: elastic/kibana#71552

* changelog

* Change values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v7.9.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants