-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] add timeout for "stop" lifecycle #90432
Conversation
Pinging @elastic/kibana-core (Team:Core) |
const result = (await Promise.race([ | ||
this.plugins.get(pluginName)!.stop(), | ||
new Promise((resolve) => setTimeout(() => resolve({ delayed: true }), 30 * Sec)), | ||
])) as { delayed?: boolean }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern might cause some unexpected memory issues in node. We need to clear the timeout
after the promise has been resolved to avoid build up of zombie timeouts in the worst cases. In normal scenarioas we're just avoiding piling up of unnecessary timeouts.
export const PROMISE_TIMEOUT = {}; // we can use a Symbol here too
export async function raceTo<T>(promise: Promise<T>, ms: number): Promise<T | typeof PROMISE_TIMEOUT> {
let timeout: NodeJS.Timeout;
try {
const result = await Promise.race([
promise,
new Promise<typeof PROMISE_TIMEOUT>((resolve) => {
timeout = setTimeout(() => resolve(PROMISE_TIMEOUT), ms);
}),
]);
if (timeout!) clearTimeout(timeout);
return result;
} catch (err) {
if (timeout!) clearTimeout(timeout);
throw err;
}
}
const result = raceTo(this.plugins.get(pluginName)!.stop(), 30 * Sec);
if (result === PROMISE_TIMEOUT) {
this.log.warn(...);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern might cause some unexpected memory issues in node.
Is there an example of a study of how bad it affects memory consumption? I've never seen such a recommendation in nodejs docs. This code is short-living: The Core runs the function ~100 times as a part of the shutdown process. I'm okay to add the proposal, but it doesn't look like a critical functionality, but might be unnecessary micro-optimisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but might be unnecessary micro-optimisation.
It is more about a good coding practice too.
You're resolving the promise yet there is a trailing timeout that wants to keep the lexical scope of the wrapping function for no reason. As for the memory effect for this, in addition to preventing the function context to be marked for garbage collection in time, you are adding unncessary cycles to the 'message queue' before triggering the clearing of those timeouts in the "event loop".
In another other words; If all stop functions are resolved immediately and nodejs is ready to exit. You will be holding node from closing the process for an additional 30 seconds until the timeouts are unreffed and cleared from the message queue.
Is there an example of a study of how bad it affects memory consumption?
I dont have an article handy but this is how nodejs works. You can dig into how the timer works here however if you're interested:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slight memory leak while exiting is probably fine, but it would be worth testing that we don't increase the exit delay unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slight memory leak while exiting is probably fine
I'm slightly confused because this is changing a couple of lines in the implementation. Anyways you can test that it does delay the shutdown by calling this:
// node test.js
let timeout1;
let timeout2;
const result = Promise.race([
new Promise((resolve) => {
timeout1 = setTimeout(() => resolve('ok'), 100);
}),
new Promise((resolve) => {
timeout2 = setTimeout(() => resolve('ok'), 30000);
}),
]).then(() => {
clearTimeout(timeout1), clearTimeout(timeout2);
})
If you comment out the clearTimeout
line you'll get a delay of 30 seconds before the node process exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will be holding node from closing the process for an additional 30 seconds until the timeouts are unreffed and cleared from the message queue.
That's a fair point - I'm going to refactor the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably just unref()
the timeout objects. That's what we used to do ALL THE TIME (in a previous job). Especially for shutdown logic. That will keep node from "staying around" until the timeouts "complete". I believe clearing the timeout will similarly "unref" it, but you can do the unref early, right after the timeout is created.
I don't think I'd worry so much about other resources (lexically accessed) still being held onto. Kibana will be coming down, so there's probably not much help in marking objects as inaccessible for the GC to reuse. Hopefully the GC is not allocating (much) more object space during this phase anyway!
As an optimization, it would be good to parallelize any not-related plugins so you can overlap some of these timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmuellr That's an option. It's not consistent with setup
and start
phases, though: lifecycles are executed in the order defined by the dependency graph. In theory, they can even expose stop
contracts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a code snippet for clearing the timeouts we create
|
||
const result = (await Promise.race([ | ||
this.plugins.get(pluginName)!.stop(), | ||
new Promise((resolve) => setTimeout(() => resolve({ delayed: true }), 30 * Sec)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since async stop isn't deprecated and we want to give plugins a chance to do a graceful shutdown (#83612) 10s feels too short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Do we want to change the Plugin
and AsyncPlugin
's stop
signature to reflect that returning a Promise is supported? atm it's stop?(): void;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgayvallet I'd wait for an official decision on that matter in #83612
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the exit delay issue raised by @Bamieh this looks good to me.
|
||
const result = (await Promise.race([ | ||
this.plugins.get(pluginName)!.stop(), | ||
new Promise((resolve) => setTimeout(() => resolve({ delayed: true }), 30 * Sec)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since async stop isn't deprecated and we want to give plugins a chance to do a graceful shutdown (#83612) 10s feels too short.
const result = (await Promise.race([ | ||
this.plugins.get(pluginName)!.stop(), | ||
new Promise((resolve) => setTimeout(() => resolve({ delayed: true }), 30 * Sec)), | ||
])) as { delayed?: boolean }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slight memory leak while exiting is probably fine, but it would be worth testing that we don't increase the exit delay unnecessarily.
💔 Build Failed
Failed CI StepsTest FailuresKibana Pipeline / general / Chrome X-Pack UI Functional Tests.x-pack/test/functional/apps/transform/feature_controls/transform_security·ts.transform feature controls security global all privileges (aka kibana_admin) should not render the "Stack" sectionStandard Out
Stack Trace
Kibana Pipeline / general / Chrome X-Pack UI Functional Tests.x-pack/test/functional/apps/transform/feature_controls/transform_security·ts.transform feature controls security global all privileges (aka kibana_admin) should not render the "Stack" sectionStandard Out
Stack Trace
Kibana Pipeline / general / X-Pack API Integration Tests.x-pack/test/api_integration/apis/ml/modules/setup_module·ts.apis Machine Learning modules module setup sets up module data for logs_ui_categories with prefix, startDatafeed true and estimateModelMemory trueStandard Out
Stack Trace
Metrics [docs]Async chunks
Page load bundle
History
To update your PR or re-run it, just comment with: |
Summary
While testing #25526, I found that the Kibana server stuck in the process of waiting for
logEvents
pluginstop
(somewhere within https://github.com/elastic/kibana/blob/4584a8b570402aa07832cf3e5b520e5d2cfa7166/x-pack/plugins/event_log/server/es/cluster_client_adapter.ts) whenstop
is caused by an exception duringstart
lifecycle. cc @pmuellrSo I added 30sec timeout for the
stop
lifecycle.Its difference, from the timeout for the
setup
andstart
lifecycles, is that on timeout it doesn't raise an exception but stops waiting and moves on to the next plugin.Plugin API Changes
Kibana plugin system has got a concept of asynchronous lifecycles for all the Kibana plugins https://www.elastic.co/guide/en/kibana/current/kibana-platform-plugin-api.html#plugin-lifecycles
We added a timeout (30 sec. by default) to make sure that
stop
lifecycle doesn't stop the shutdown process for the whole Kibana server. If a plugin doesn't completestop
lifecycle in 30 sec., Kibana moves on to the next plugin.