-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task manager] Parallelise task execution #48053
[Task manager] Parallelise task execution #48053
Conversation
…specified to bulkUpdate
💔 Build Failed |
…ty tests to improve coverage
💔 Build Failed |
💔 Build Failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure my review is worth a LGTM/approval since I'm very unfamiliar with the TM code. Basically just reviewed the changes, did go back and look at the full source a few times, to help get myself familiar with things.
Probably the most important thing I noticed was an setTimeout() without a try/catch, which in theory could cause Kibana to exit with no diagnostic logged on the error (I've seen this in my own plugins :-) ).
Made a bunch of other nit comments or things to discuss later, but in general LGTM.
} | ||
return TaskPoolRunResult.RanOutOfCapacity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was curious what happens with RanOutOfCapacity
returns, and ... not much, besides some perf_hooks timing. And presumably useful for tests. Probably worthy of a logger debug message, with number of left over tasks.
Kinda relatedly, I wondered what happens to these tasks - I guess they got "claimed" but not run, so will eventually get recognized as claimable things later, perhaps the claimOwnershipUntil
property I saw above (30s). Seems like this sort of thing should be noted in the README, but it looks like the README is kinda outdated (and not changed in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's not actually true... and I made a similar mistake when I first read this code. This shows this code is still a little too side affecty and complex. I want to make it easier to read after the perf work is done.
Prior to this PR this was just returning a boolean value and you'd have to try and figure out what that meant, so the main change here that we use an enum, which I feel makes it easier to understand.
Anyway, to clarify what actually happens:
When RanOutOfCapacity
is returned, it causes fill_poll
to end the cycle and wait until the next scheduled interval.
If it were to return one of the other options the while(true)
cycled, leading to another inlined polling for work, meaning less dead time.
This logic predate me, but my understanding from it is that the idea is to let Task Manger "rest" before polling for more work, as all the workers are busy and can't pick up work anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ... for some reason, I thought RanOutOfCapacity meant that it had ended up claiming a task, but no available worker to run it, so ... oh well, will get picked up later.
I think updating the README describing how this works would be helpful.
As I come to understand this stuff better (thx for the insights!), I have some wacky ideas - like, why wait for the full fill_pool
interval if you find, while processing the tasks, that you have a lot of available workers? Assumption being that a fair number of tasks will complete during before the interval runs again. The "notification" actions - slack, email, webhook, pagerDuty, will all be pretty quick (just an http request), but OTOH, I would expect notification actions to actually be queued to run pretty infrequently (who wants 100 slack messages per second? 😄).
We should probably start running the new SIEM alerts/signals stuff (or are you already?), see what the perf_hooks tell us - #49451
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at SIEM at all, definitely worth trying.
Regarding the fill_pool
, unless I misunderstood what you're asking, we shouldn't work if we had available workers, as the loop will repeat. We only end the loop (and hence, wait the interval) if we have no workers available.
x-pack/test/plugin_api_perf/test_suites/task_manager/task_manager_perf_integration.ts
Outdated
Show resolved
Hide resolved
@@ -18,6 +18,25 @@ | |||
*/ | |||
|
|||
module.exports = (_, options = {}) => { | |||
const overrides = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm stoked to see usage of perf_hooks
in the code, but not crazy about the implementation. Probably worth a discussion on how we might do this better. Eg, don't think it should change the build bits, seems fragile, perhaps should be a Kibana config value; so many literal strings in performance.mark()
and performance.measure()
; both the previous points lead me to believe we should wrap the perf_hooks
bits in a class/interface, somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't actually be a need to remove them at all.
We're removing them just because we don't need them in prod and figured it would make people feel more comfortable about including them in TM for the perf test.
I can see the value in using a wrapper to keep TM's code a little neater... I'll revisit that after merging as I feel this branch has lived too long and merge conflicts are piling up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should find another solution for this - I really don't like the idea of modifying the build based on an environmental variable and this will need changed soon as we work towards hermetic builds.
@@ -655,9 +728,10 @@ describe('TaskManagerRunner', () => { | |||
await runner.run(); | |||
|
|||
if (shouldBeValid) { | |||
sinon.assert.notCalled(logger.warn); | |||
expect(logger.warn).not.toHaveBeenCalled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking logger.warn()
for calls seems very fragile, but also seems like it will be fairly obvious what's going on if it does end up breaking (because some other warning occurred in between, or whatever). And it's been like this forever I guess :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yeah, I thought the same, but chose just to get rid of the sinon stuff as it was making things difficult to change.
I'll see if it's worth cleaning up as part of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said, it seems like it would be fairly obvious if it did break, and if that's true, not sure it's worth much time fixing. Something for a backlog card somewhere, probably never to be fixed hehehe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in the files own by operations team looks good to me
💔 Build Failed |
💚 Build Succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, I haven't looked into x-pack/test/plugin_api_perf
but maybe that's something @bmcconaghy may be better at reviewing?
@elasticmachine merge upstream |
💚 Build Succeeded |
💚 Build Succeeded |
@elasticmachine merge upstream |
💚 Build Succeeded |
A heads up this is failing the release manager: https://groups.google.com/a/elastic.co/d/msg/build-release-manager/QAhLJh5OecE/JGRa2EAvAQAJ |
We believe this will be resolved by #50090 |
Summary
This PR include three key changes:
refresh: false
where possible, in place ofwait_for
, in order to speed up Task Manager's internal workflowcloses #49628
Checklist
Use
strikethroughsto remove checklist items you don't feel are applicable to this PR.This was checked for cross-browser compatibility, including a check against IE11This was checked for keyboard-only and screenreader accessibilityNote for Reviewers
The addition of Midleware.beforeMarkRunning is there just for performance testing these changes and I plan on stripping it out before integrating into Master.
So unless someone makes an argument for keeping this middleware, feel free to skip that part of the changes in your review and making your LGTM contingent on removal of this addition.
For maintainers