[Task manager] Parallelise task execution #48053

gmmorris · 2019-10-12T12:46:02Z

Summary

This PR include three key changes:

Run tasks as soon as they have been marked as running, rather than wait for the whole batch to me marked
Use a custom refresh setting of refresh: false where possible, in place of wait_for, in order to speed up Task Manager's internal workflow
Instrumentation of Task Manager exposing Activity / Inactivity metrics in Performance test runs

closes #49628

Checklist

Use ~~strikethroughs~~ to remove checklist items you don't feel are applicable to this PR.

~~This was checked for cross-browser compatibility, including a check against IE11~~
Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
~~This was checked for keyboard-only and screenreader accessibility~~

Note for Reviewers

The addition of Midleware.beforeMarkRunning is there just for performance testing these changes and I plan on stripping it out before integrating into Master.
So unless someone makes an argument for keeping this middleware, feel free to skip that part of the changes in your review and making your LGTM contingent on removal of this addition.

For maintainers

This was checked for breaking API changes and was labeled appropriately
This includes a feature addition or change that requires a release note and was labeled appropriately

…tory

…specified to bulkUpdate

…jects clients

…s other props

…kStore

elasticmachine · 2019-10-12T13:55:02Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: fef7dbe

…ty tests to improve coverage

elasticmachine · 2019-10-28T13:36:45Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: fb6d0de

elasticmachine · 2019-10-28T16:19:23Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: f1f53f6

pmuellr

Not sure my review is worth a LGTM/approval since I'm very unfamiliar with the TM code. Basically just reviewed the changes, did go back and look at the full source a few times, to help get myself familiar with things.

Probably the most important thing I noticed was an setTimeout() without a try/catch, which in theory could cause Kibana to exit with no diagnostic logged on the error (I've seen this in my own plugins :-) ).

Made a bunch of other nit comments or things to discuss later, but in general LGTM.

x-pack/legacy/plugins/task_manager/task_poller.ts

x-pack/legacy/plugins/task_manager/task_pool.ts

pmuellr · 2019-10-28T15:52:55Z

x-pack/legacy/plugins/task_manager/task_pool.ts

      }
+      return TaskPoolRunResult.RanOutOfCapacity;


Was curious what happens with RanOutOfCapacity returns, and ... not much, besides some perf_hooks timing. And presumably useful for tests. Probably worthy of a logger debug message, with number of left over tasks.

Kinda relatedly, I wondered what happens to these tasks - I guess they got "claimed" but not run, so will eventually get recognized as claimable things later, perhaps the claimOwnershipUntil property I saw above (30s). Seems like this sort of thing should be noted in the README, but it looks like the README is kinda outdated (and not changed in this PR).

Hmm, that's not actually true... and I made a similar mistake when I first read this code. This shows this code is still a little too side affecty and complex. I want to make it easier to read after the perf work is done.

Prior to this PR this was just returning a boolean value and you'd have to try and figure out what that meant, so the main change here that we use an enum, which I feel makes it easier to understand.

Anyway, to clarify what actually happens:
When RanOutOfCapacity is returned, it causes fill_poll to end the cycle and wait until the next scheduled interval.
If it were to return one of the other options the while(true) cycled, leading to another inlined polling for work, meaning less dead time.
This logic predate me, but my understanding from it is that the idea is to let Task Manger "rest" before polling for more work, as all the workers are busy and can't pick up work anyway.

Ah ... for some reason, I thought RanOutOfCapacity meant that it had ended up claiming a task, but no available worker to run it, so ... oh well, will get picked up later.

I think updating the README describing how this works would be helpful.

As I come to understand this stuff better (thx for the insights!), I have some wacky ideas - like, why wait for the full fill_pool interval if you find, while processing the tasks, that you have a lot of available workers? Assumption being that a fair number of tasks will complete during before the interval runs again. The "notification" actions - slack, email, webhook, pagerDuty, will all be pretty quick (just an http request), but OTOH, I would expect notification actions to actually be queued to run pretty infrequently (who wants 100 slack messages per second? 😄).

We should probably start running the new SIEM alerts/signals stuff (or are you already?), see what the perf_hooks tell us - #49451

I haven't looked at SIEM at all, definitely worth trying.

Regarding the fill_pool, unless I misunderstood what you're asking, we shouldn't work if we had available workers, as the loop will repeat. We only end the loop (and hence, wait the interval) if we have no workers available.

x-pack/legacy/plugins/task_manager/task_pool.ts

x-pack/test/plugin_api_perf/test_suites/task_manager/task_manager_perf_integration.ts

yarn.lock

pmuellr · 2019-10-28T16:56:40Z

packages/kbn-babel-preset/node_preset.js

@@ -18,6 +18,25 @@
 */

 module.exports = (_, options = {}) => {
+  const overrides = [];


nit: I'm stoked to see usage of perf_hooks in the code, but not crazy about the implementation. Probably worth a discussion on how we might do this better. Eg, don't think it should change the build bits, seems fragile, perhaps should be a Kibana config value; so many literal strings in performance.mark() and performance.measure(); both the previous points lead me to believe we should wrap the perf_hooks bits in a class/interface, somehow.

There shouldn't actually be a need to remove them at all.
We're removing them just because we don't need them in prod and figured it would make people feel more comfortable about including them in TM for the perf test.

I can see the value in using a wrapper to keep TM's code a little neater... I'll revisit that after merging as I feel this branch has lived too long and merge conflicts are piling up.

I agree we should find another solution for this - I really don't like the idea of modifying the build based on an environmental variable and this will need changed soon as we work towards hermetic builds.

pmuellr · 2019-10-28T17:09:57Z

x-pack/legacy/plugins/task_manager/task_runner.test.ts

@@ -655,9 +728,10 @@ describe('TaskManagerRunner', () => {
    await runner.run();

    if (shouldBeValid) {
-      sinon.assert.notCalled(logger.warn);
+      expect(logger.warn).not.toHaveBeenCalled();


checking logger.warn() for calls seems very fragile, but also seems like it will be fairly obvious what's going on if it does end up breaking (because some other warning occurred in between, or whatever). And it's been like this forever I guess :-)

Haha yeah, I thought the same, but chose just to get rid of the sinon stuff as it was making things difficult to change.

I'll see if it's worth cleaning up as part of this.

Like I said, it seems like it would be fairly obvious if it did break, and if that's true, not sure it's worth much time fixing. Something for a backlog card somewhere, probably never to be fixed hehehe.

mistic

The changes in the files own by operations team looks good to me

elasticmachine · 2019-10-29T11:43:49Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: e6f7976

…nd its less needed

elasticmachine · 2019-10-29T13:40:41Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 1a85e69

mikecote

Code LGTM, I haven't looked into x-pack/test/plugin_api_perf but maybe that's something @bmcconaghy may be better at reviewing?

gmmorris · 2019-10-31T14:46:36Z

@elasticmachine merge upstream

elasticmachine · 2019-10-31T16:09:17Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 69f78c5

elasticmachine · 2019-11-01T23:27:20Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 15c3f17

gmmorris · 2019-11-05T23:19:55Z

@elasticmachine merge upstream

elasticmachine · 2019-11-06T00:40:06Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: c8cfb92

tylersmalley · 2019-11-07T16:53:50Z

A heads up this is failing the release manager: https://groups.google.com/a/elastic.co/d/msg/build-release-manager/QAhLJh5OecE/JGRa2EAvAQAJ

tylersmalley · 2019-11-07T18:37:00Z

We believe this will be resolved by #50090

gmmorris added 27 commits October 7, 2019 15:31

refactor(saved-objects): added bulkUpdate api for saved object reposi…

dbefbdc

…tory

fix(saved-objects): fixed mistyped object

173c8d8

Merge branch 'master' into saved-objects/bulkupdate

b4c2756

feat(saved-objects): Do not overwrite saved object references if not …

225f40e

…specified to bulkUpdate

feat(saved-objects): Expose bulkUpdate on Saved Objects api

be89ca9

feat(saved-objects): Added bulkUpdate to spaces and encrypted SavedOb…

bf2fb2d

…jects clients

fix(saved-objects): bulkUpdate now returns a references key like update

029e503

fix(saved-objects): fixed typo and missing API tag

f8eebdb

doc(saved-objects): Added bulkUpdate to public API docs

c52203d

Merge branch 'master' into saved-objects/bulkupdate

3700aa6

fix(saved-objects): Fixed broken test due to api change

25730a3

fix(saved-objects): Fixed broken test snapshots due to api change

cec9dfc

fix(saved-objects): Fixed broken test snapshots due to api change

437d6c1

Merge branch 'master' into saved-objects/bulkupdate

9889ab1

Merge branch 'master' into saved-objects/bulkupdate

6cb456b

refactor(saved-objects): Use SavedObjectsClientMock wherever possible

5523013

fix(saved-objects): fixed broken types

6c648b2

refactor(saved-objects): Use await in place of then

6367b42

refactor(saved-objects): merged options in bulkUpdate to same level a…

2b76226

…s other props

doc(saved-objects): Updated bulkUpdate API docs

55539db

doc(saved-objects): Updated bulkUpdate API docs further

a24d128

fix(saved-objects): fixed test that still used options

52621c5

feat(saved-objects): added bulkUpdate to secure saved objects client

2b106a6

Merge branch 'master' into saved-objects/bulkupdate

d12e362

refactor(saved-objects): update always has an id

089aadc

refactor(saved-objects): removed unused type

be46e6f

feat(task-manager): Use TaskStoreBuffer in between TaskRunner and Tas…

fef7dbe

…kStore

gmmorris added 2 commits October 12, 2019 07:10

test(saved-objects): added globalType update in all bulkUpdate securi…

b92bbc1

…ty tests to improve coverage

fix(task-manager): removed unused import

ee5f768

Merge branch 'master' into task-manager/decouple-store-from-runner

fb6d0de

gmmorris added 2 commits October 28, 2019 16:06

build(task-manager): fixed dependnecies in yarn lock

be91e0c

Merge branch 'master' into task-manager/decouple-store-from-runner

f1f53f6

pmuellr approved these changes Oct 28, 2019

View reviewed changes

mistic approved these changes Oct 29, 2019

View reviewed changes

gmmorris added 4 commits October 29, 2019 10:24

build(task-manager): fixed dependnecies in yarn lock again

e6f7976

refactor(task-manager): handle error in polling stage

7fccd5c

test(task-manager): removed unneeded import

7d076e8

test(task-manager): use log in place of info in perf test

4c869db

gmmorris added 3 commits October 29, 2019 12:02

i18n(task-manager): removed i18n of Logs

159434c

refactor(task-manager): remove use of result as code is simpler now a…

ca91c02

…nd its less needed

Merge branch 'master' into task-manager/decouple-store-from-runner

1a85e69

mikecote approved these changes Oct 29, 2019

View reviewed changes

Merge branch 'master' into task-manager/decouple-store-from-runner

69f78c5

Merge branch 'master' into task-manager/decouple-store-from-runner

15c3f17

Merge branch 'master' into task-manager/decouple-store-from-runner

c8cfb92

gmmorris merged commit c485893 into elastic:master Nov 6, 2019

gmmorris mentioned this pull request Nov 6, 2019

[7.x] Improves performance of task execution in Task manager (#48053) #50047

Merged

gmmorris added the backported label Nov 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task manager] Parallelise task execution #48053

[Task manager] Parallelise task execution #48053

gmmorris commented Oct 12, 2019 •

edited

Loading

elasticmachine commented Oct 12, 2019

elasticmachine commented Oct 28, 2019

elasticmachine commented Oct 28, 2019

pmuellr left a comment

pmuellr Oct 28, 2019

gmmorris Oct 29, 2019 •

edited

Loading

pmuellr Oct 29, 2019

gmmorris Oct 29, 2019

pmuellr Oct 28, 2019

gmmorris Oct 29, 2019

tylersmalley Nov 7, 2019

pmuellr Oct 28, 2019

gmmorris Oct 29, 2019

pmuellr Oct 29, 2019

mistic left a comment

elasticmachine commented Oct 29, 2019

elasticmachine commented Oct 29, 2019

mikecote left a comment

gmmorris commented Oct 31, 2019

elasticmachine commented Oct 31, 2019

elasticmachine commented Nov 1, 2019

gmmorris commented Nov 5, 2019

elasticmachine commented Nov 6, 2019

tylersmalley commented Nov 7, 2019

tylersmalley commented Nov 7, 2019

[Task manager] Parallelise task execution #48053

[Task manager] Parallelise task execution #48053

Conversation

gmmorris commented Oct 12, 2019 • edited Loading

Summary

Checklist

Note for Reviewers

For maintainers

elasticmachine commented Oct 12, 2019

💔 Build Failed

elasticmachine commented Oct 28, 2019

💔 Build Failed

elasticmachine commented Oct 28, 2019

💔 Build Failed

pmuellr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmmorris Oct 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mistic left a comment

Choose a reason for hiding this comment

elasticmachine commented Oct 29, 2019

💔 Build Failed

elasticmachine commented Oct 29, 2019

💚 Build Succeeded

mikecote left a comment

Choose a reason for hiding this comment

gmmorris commented Oct 31, 2019

elasticmachine commented Oct 31, 2019

💚 Build Succeeded

elasticmachine commented Nov 1, 2019

💚 Build Succeeded

gmmorris commented Nov 5, 2019

elasticmachine commented Nov 6, 2019

💚 Build Succeeded

tylersmalley commented Nov 7, 2019

tylersmalley commented Nov 7, 2019

gmmorris commented Oct 12, 2019 •

edited

Loading

gmmorris Oct 29, 2019 •

edited

Loading