Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task manager] Parallelise task execution #48053

Merged

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Oct 12, 2019

Summary

This PR include three key changes:

  1. Run tasks as soon as they have been marked as running, rather than wait for the whole batch to me marked
  2. Use a custom refresh setting of refresh: false where possible, in place of wait_for, in order to speed up Task Manager's internal workflow
  3. Instrumentation of Task Manager exposing Activity / Inactivity metrics in Performance test runs

closes #49628

Checklist

Use strikethroughs to remove checklist items you don't feel are applicable to this PR.

Note for Reviewers

The addition of Midleware.beforeMarkRunning is there just for performance testing these changes and I plan on stripping it out before integrating into Master.
So unless someone makes an argument for keeping this middleware, feel free to skip that part of the changes in your review and making your LGTM contingent on removal of this addition.

For maintainers

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💔 Build Failed

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure my review is worth a LGTM/approval since I'm very unfamiliar with the TM code. Basically just reviewed the changes, did go back and look at the full source a few times, to help get myself familiar with things.

Probably the most important thing I noticed was an setTimeout() without a try/catch, which in theory could cause Kibana to exit with no diagnostic logged on the error (I've seen this in my own plugins :-) ).

Made a bunch of other nit comments or things to discuss later, but in general LGTM.

x-pack/legacy/plugins/task_manager/task_poller.ts Outdated Show resolved Hide resolved
x-pack/legacy/plugins/task_manager/task_pool.ts Outdated Show resolved Hide resolved
}
return TaskPoolRunResult.RanOutOfCapacity;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was curious what happens with RanOutOfCapacity returns, and ... not much, besides some perf_hooks timing. And presumably useful for tests. Probably worthy of a logger debug message, with number of left over tasks.

Kinda relatedly, I wondered what happens to these tasks - I guess they got "claimed" but not run, so will eventually get recognized as claimable things later, perhaps the claimOwnershipUntil property I saw above (30s). Seems like this sort of thing should be noted in the README, but it looks like the README is kinda outdated (and not changed in this PR).

Copy link
Contributor Author

@gmmorris gmmorris Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's not actually true... and I made a similar mistake when I first read this code. This shows this code is still a little too side affecty and complex. I want to make it easier to read after the perf work is done.

Prior to this PR this was just returning a boolean value and you'd have to try and figure out what that meant, so the main change here that we use an enum, which I feel makes it easier to understand.

Anyway, to clarify what actually happens:
When RanOutOfCapacity is returned, it causes fill_poll to end the cycle and wait until the next scheduled interval.
If it were to return one of the other options the while(true) cycled, leading to another inlined polling for work, meaning less dead time.
This logic predate me, but my understanding from it is that the idea is to let Task Manger "rest" before polling for more work, as all the workers are busy and can't pick up work anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ... for some reason, I thought RanOutOfCapacity meant that it had ended up claiming a task, but no available worker to run it, so ... oh well, will get picked up later.

I think updating the README describing how this works would be helpful.

As I come to understand this stuff better (thx for the insights!), I have some wacky ideas - like, why wait for the full fill_pool interval if you find, while processing the tasks, that you have a lot of available workers? Assumption being that a fair number of tasks will complete during before the interval runs again. The "notification" actions - slack, email, webhook, pagerDuty, will all be pretty quick (just an http request), but OTOH, I would expect notification actions to actually be queued to run pretty infrequently (who wants 100 slack messages per second? 😄).

We should probably start running the new SIEM alerts/signals stuff (or are you already?), see what the perf_hooks tell us - #49451

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at SIEM at all, definitely worth trying.

Regarding the fill_pool, unless I misunderstood what you're asking, we shouldn't work if we had available workers, as the loop will repeat. We only end the loop (and hence, wait the interval) if we have no workers available.

x-pack/legacy/plugins/task_manager/task_pool.ts Outdated Show resolved Hide resolved
yarn.lock Outdated Show resolved Hide resolved
yarn.lock Outdated Show resolved Hide resolved
@@ -18,6 +18,25 @@
*/

module.exports = (_, options = {}) => {
const overrides = [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm stoked to see usage of perf_hooks in the code, but not crazy about the implementation. Probably worth a discussion on how we might do this better. Eg, don't think it should change the build bits, seems fragile, perhaps should be a Kibana config value; so many literal strings in performance.mark() and performance.measure(); both the previous points lead me to believe we should wrap the perf_hooks bits in a class/interface, somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't actually be a need to remove them at all.
We're removing them just because we don't need them in prod and figured it would make people feel more comfortable about including them in TM for the perf test.

I can see the value in using a wrapper to keep TM's code a little neater... I'll revisit that after merging as I feel this branch has lived too long and merge conflicts are piling up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should find another solution for this - I really don't like the idea of modifying the build based on an environmental variable and this will need changed soon as we work towards hermetic builds.

@@ -655,9 +728,10 @@ describe('TaskManagerRunner', () => {
await runner.run();

if (shouldBeValid) {
sinon.assert.notCalled(logger.warn);
expect(logger.warn).not.toHaveBeenCalled();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking logger.warn() for calls seems very fragile, but also seems like it will be fairly obvious what's going on if it does end up breaking (because some other warning occurred in between, or whatever). And it's been like this forever I guess :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yeah, I thought the same, but chose just to get rid of the sinon stuff as it was making things difficult to change.

I'll see if it's worth cleaning up as part of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said, it seems like it would be fairly obvious if it did break, and if that's true, not sure it's worth much time fixing. Something for a backlog card somewhere, probably never to be fixed hehehe.

Copy link
Member

@mistic mistic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in the files own by operations team looks good to me

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, I haven't looked into x-pack/test/plugin_api_perf but maybe that's something @bmcconaghy may be better at reviewing?

@gmmorris
Copy link
Contributor Author

@elasticmachine merge upstream

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@gmmorris
Copy link
Contributor Author

gmmorris commented Nov 5, 2019

@elasticmachine merge upstream

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@tylersmalley
Copy link
Contributor

A heads up this is failing the release manager: https://groups.google.com/a/elastic.co/d/msg/build-release-manager/QAhLJh5OecE/JGRa2EAvAQAJ

@tylersmalley
Copy link
Contributor

We believe this will be resolved by #50090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task manager] Parallelise task execution
6 participants