-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] [Resource Leak] Gracefully Close ZMQ Context upon kernel shutdown #548
Conversation
During kernel startup we create new context of ZMQ. However during kernel shutdown we are not shuting down ZMQ context This leads to leaks of ZMQ reaper sockets.
Ping @JohanMabille. |
Well done, @jalpan-randeri! I'll buy you a beverage of choice if we meet in person. |
Outstanding @jalpan-randeri - thank you! I ditto @blink1073's offer! This appears to resolve the leaks we're seeing in EG: jupyter-server/enterprise_gateway#762 (comment) |
The recent commit makes sense for where we want to be and |
@kevin-bates Sure, will you guide me on how to achieve this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job tracking down the issue! I think the ideal fix is to share the global Context rather than requiring separate Contexts per kernel manager, which this PR does. I thought we were already doing that, but apparently not.
The standard pattern for contexts in zmq applications is to use a single process-global Context object (facilitated with zmq.Context.instance()
as the default provider if passing around is unwieldy, but passing around is more explicit).
The fact that we are creating Contexts for each KernelManager is actually part of the problem - creating several too many threads and sockets for internal zmq operations per kernel, rather than using the same Context throughout the application except where explicitly required.
I think this issue might be resolved by a smaller change, switching the default context: def _context_default(self):
return zmq.Context() to default to the shared global context (more idiomatic use of zmq): def _context_default(self):
return zmq.Context.instance() (this default generator occurs in a few places) If this change still leaks FDs, that means there are sockets being left open that should be closed, not just Contexts. These should be identified and closed explicitly in stop_channels. |
Thanks @minrk.
You're correct. You had added a global context 9 years ago, that was switched just over a year ago. The change to a local context first took place in #437 to address multiprocessing aspects of kernels. It seems that reverting to a global context would jeopardize those efforts (there were 3 additional PRs related to this, although those appear to be mostly version-management related, back-ports, etc)? FWIW, I've reverted my EG env back to using a global ZMQ context and do not see the leaks occurring (implying we haven't lost track of the sockets). 👍 |
@minrk ! Glad to see you on more threads again :) Yeah as @kevin-bates pointed out we definitely don't want to go back to globals as they break all concurrency in higher abstractions, causing deadlocks against the global state. So it looks like we need to dig into zmq context cleanup more to figure out why we're not cleaning the FDs up under ideal exit conditions? |
Or did I misunderstand the latest comments? Is it that we still see FD leaks with this PR or that this PR resolves them similarly to switching back to a global context? |
We no longer see the leaks with this PR or if we were to switch back to the global context as Min suggested. My primary concern is compatibility with existing installations. After thinking further I think we need at least two releases - a backport to 5.3.x and a current release. I'd prefer we have a compatible fix in 6.1 as well, then have the incompatible fix in, I guess, 6.2? |
Looking at system level, without this fix we are leaking only ZMQbg/Reaper sockets, these seems to be coming from ZMQ context only. We create 5 sockets during kernel startup and we shutdown these 5 sockets at shutdown, Left behinds are only The ask was to make ZMQ.context global, which will lead to race conditions, we have exclusive 1:1 zmq context per kernel manager. This PR fixes it by destroying zmq context at kernel shutdown. I tested with running my script concurrently and did not observe any failures I am just stuck at making change backward compatible. |
I don't think we need to backport to 5.x unless we really want to (and shouldn't block this PR if we do) since Python 2 is fully unsupported now. The 6.x release has been fairly stable relative with only some carryover of 5.x issues that weren't resolved like this one. Do you have some code @kevin-bates that still relies on Python 2 for EG that would need this fix? What's the incompatible fix vs compatible fix referring to? I think this would be a 6.2 release since there's some minor contract changes and a cleanup change. Are we good with a merge for now and follow up with release strategy? |
I agree with moving forward and merging now, but I believe we need at least one backport.
We have users that are running EG relative to 5.3.4. I don't know if they're running Python 2 but when taking master to check this change out, they ran into another 6.x change (port caching) that breaks them and they shouldn't have to take a new EG release to have their leaks fixed. Enterprise users are slow-moving, so would really prefer an option for them via 5.3.5.
I may not be understanding python bindings correctly, but when moved into I believe the "compatible" change needs to either be the original commit (that lies outside There may be other applications besides EG affected by this and those applications shouldn't require their own patch releases to resolve their leaks, IMO. FWIW, I just pulled the candidate fix w/o changing EG and this manifests itself as a TypeError due to the positional argument change.
|
Ahh I see.
That just looks like the shutdown_kernel call was missed in the PR's changes: https://github.com/jupyter/jupyter_client/blob/master/jupyter_client/multikernelmanager.py#L183 We can try making the interface more backwards compatible before releasing. Should we just accept |
Alright, if you think this change is important for those users we can make a back-port for this after we get a 6.x release. |
Thank you Matthew - this change is critical for EG users since EG represents a long-running server who's sole purpose in life is to create and shutdown kernels and leaks like these add up quickly. Regarding the change itself, |
@jalpan-randeri Would you mind adding the backwards comparability for |
This diff applies the following changes:
and has no warnings with lab or classic in my tests (because contexts are never terminated) What do folks think? |
This won't be guaranteed to be called until after all references are cleaned, and even then it's not guaranteed to run if the process is exiting. So if someone has a local
That seems like a good idea to add. |
What's the best way forward? It does make sense to me to have explicit cleanup, not relying on del, but shutdown does not seem like the right place to close a context (it's quite reasonable to close sockets after shutdown, and we see that jupyterlab does exactly that, hence the errors reported above). We have already that 'explicit cleanup' is So maybe the right thing is to keep this as-is—manager terminates the context on shutdown when the context is not shared, and then ensure for contexts like the notebook server that the context is shared (cherry-pick 4cdbf54 onto this PR)? We'll get implicit cleanup on shutdown in single KernelManager use cases, one context by default in MultiKernel cases (notebook servers, most notably). |
I think that could be a working plan. I think we want to emphasize that client cleanup and if they don't we do a best effort to cleanup for them. That should cover current behavior for lab and classic unless I am mistaken, and we can ask that lab move to a more deliberate cleanup mode over time? Unless I missed anything? |
Should one of us apply the commit to this PR and get it so folks can test it out to confirm behavior before merge? |
just to confirm, we want to provide cleanup_resources as explicit call and not do it as part of shutdown method. something like this commit 4cdbf54 |
@jalpan-randeri yes, please cherry-pick that commit onto your PR. Then I think we can proceed with testing. @MSeal to be clear, I believe lab is doing nothing wrong - explicitly closing sockets after shutdown in general should be fine, so I don't think we need to ask them to change. If there is a change to make forcing the closure of sockets before shutdown, that would belong in the notebook server, collecting and closing all |
cherry-picking commit gave me following error Not sure if this is expected behavior
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have the one comment regarding the warning message.
I tried this out using EnterpriseGateway and it works great - other than not seeing any warnings. I do not see leaks happening nor do I see issues stemming from JupyterLab closing the WebSocket after shutdown. Two thumbs up! 👍 👍
jupyter_client/manager.py
Outdated
def cleanup(self, connection_file=True): | ||
"""Clean up resources when the kernel is shut down""" | ||
warnings.warn("Method cleanup(connection_file=True) is deprecated, use cleanup_resources(restart=False).", | ||
DeprecationWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in this previous comment, I'm not seeing any warning messages produced in the server console/logs. I think we should adjust the class such that it produces at least one message. (Using FutureWarning
does just that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with a merge now. Will let kevin or min do the merge here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let's let @minrk confirm.
Looks great to me, thanks everyone! |
@kevin-bates if you want to do a backport in order to make sure more versions of the notebook server get it, backporting just the shared context for MultiKernelManager should be the smallest way to achieve that. |
Thanks @minrk. I'm kinda new to back-porting, but it sounds like I'd cherry-pick 2d5ba4b and 1ce1d97 to a 5.x branch. Since no branch for back-ports exists, would we just create 5.x from the 5.3.4 tag? or should that branch be named 5.3.x? If someone could set up the target branch, I'd be happy to back-port the appropriate commits and help get a 5.3.5 out. |
5.x branch is now at 5.3.4, if you want to make the PR backporting those two commits. I don't think we need A.B.x branches except where we are backporting past the latest minor revision for a given major revision (e.g. 5.2.x). |
I can help kick off a releases for these things. Shall I do the |
|
Thank you Matt! |
wow!. thank you Kevin, Matt, Min RK 😃 |
As part of kernel manager instance creation, we create new context of ZMQ.
This kernel manager are responsible for managing lifecycle of kernel process.
At the end of kernel process lifecycle we are just simply killing kernel process and not closing ZMQ context. This leaked ZMQ contexts holds sockets and end up exhausting system resources
This PR fixes this by gracefully shutting down ZMQ context during kernel shutdown process
Tests
Manual Tests Steps:
kernel_lifecycle.sh
Track socket usage using following cmd
Local Macbook Test Runs:
After 3 executions of kernel_lifecycle script, socket usage