Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] [Resource Leak] Gracefully Close ZMQ Context upon kernel shutdown #548

Merged
merged 14 commits into from
Jun 22, 2020

Conversation

jalpan-randeri
Copy link

@jalpan-randeri jalpan-randeri commented May 24, 2020

As part of kernel manager instance creation, we create new context of ZMQ.
This kernel manager are responsible for managing lifecycle of kernel process.
At the end of kernel process lifecycle we are just simply killing kernel process and not closing ZMQ context. This leaked ZMQ contexts holds sockets and end up exhausting system resources

This PR fixes this by gracefully shutting down ZMQ context during kernel shutdown process

Tests

  • Enhanced existing tests to track zmq context checks

Manual Tests Steps:

  1. start jupyter server
  2. run kernel_lifecycle.sh script 3 times
  3. record socket usage

kernel_lifecycle.sh

kernel_id=$(curl --silent --show-error  -X POST 'localhost:8888/api/kernels' | jq .id | tr -d '"')
echo $kernel_id
read -n 2 -p 'Stop?'
echo 'stopping '
curl -X DELETE "localhost:8888/api/kernels/$kernel_id"

Track socket usage using following cmd

macos:   lsof -p <pid of jupyter> | grep unix | wc -l
linux:   lsof -p <pid of jupyter> | grep event | wc -l

Local Macbook Test Runs:

After 3 executions of kernel_lifecycle script, socket usage

  • without fix
$ lsof -p 17207 | grep unix | wc -l 
      21
  • with fix
$ lsof -p 16076 | grep unix | wc -l 
       3

During kernel startup we create new context of ZMQ.
However during kernel shutdown we are not shuting down ZMQ context
This leads to leaks of ZMQ reaper sockets.
@jalpan-randeri jalpan-randeri marked this pull request as ready for review May 24, 2020 08:21
@jalpan-randeri
Copy link
Author

@SylvainCorlay
Copy link
Member

Ping @JohanMabille.

@blink1073
Copy link
Contributor

Well done, @jalpan-randeri! I'll buy you a beverage of choice if we meet in person.

@kevin-bates
Copy link
Member

Outstanding @jalpan-randeri - thank you! I ditto @blink1073's offer! This appears to resolve the leaks we're seeing in EG: jupyter-server/enterprise_gateway#762 (comment)

jupyter_client/manager.py Outdated Show resolved Hide resolved
jupyter_client/manager.py Outdated Show resolved Hide resolved
@kevin-bates
Copy link
Member

The recent commit makes sense for where we want to be and restart=False is common, but I think we need a b/c change for existing users of KerneManager subclasses that need this awesome fix. Any chance we could add the original commit (or compatible) into a 6.1.4 release? Then, I guess we'd need 6.2.0 for the parameter change.

@jalpan-randeri
Copy link
Author

@kevin-bates Sure, will you guide me on how to achieve this?
I found a tag for 6.1.4, but i am not sure how to cherry-pick original commit to this release.
Alternatively, I was thinking reusing connection_file parameter to cleanup ZMQ context, but then it made logic convoluted

Copy link
Member

@minrk minrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job tracking down the issue! I think the ideal fix is to share the global Context rather than requiring separate Contexts per kernel manager, which this PR does. I thought we were already doing that, but apparently not.

The standard pattern for contexts in zmq applications is to use a single process-global Context object (facilitated with zmq.Context.instance() as the default provider if passing around is unwieldy, but passing around is more explicit).

The fact that we are creating Contexts for each KernelManager is actually part of the problem - creating several too many threads and sockets for internal zmq operations per kernel, rather than using the same Context throughout the application except where explicitly required.

jupyter_client/manager.py Outdated Show resolved Hide resolved
@minrk
Copy link
Member

minrk commented May 26, 2020

I think this issue might be resolved by a smaller change, switching the default context:

    def _context_default(self):
        return zmq.Context()

to default to the shared global context (more idiomatic use of zmq):

    def _context_default(self):
        return zmq.Context.instance()

(this default generator occurs in a few places)

If this change still leaks FDs, that means there are sockets being left open that should be closed, not just Contexts. These should be identified and closed explicitly in stop_channels. destroy is a way to ensure there are no sockets still open, even if we lost track of them, but in the long run it's probably better to not lose track of those sockets in the first place.

@kevin-bates
Copy link
Member

Thanks @minrk.

I think the ideal fix is to share the global Context rather than requiring separate Contexts per kernel manager, which this PR does. I thought we were already doing that, but apparently not.

You're correct. You had added a global context 9 years ago, that was switched just over a year ago. The change to a local context first took place in #437 to address multiprocessing aspects of kernels. It seems that reverting to a global context would jeopardize those efforts (there were 3 additional PRs related to this, although those appear to be mostly version-management related, back-ports, etc)?

FWIW, I've reverted my EG env back to using a global ZMQ context and do not see the leaks occurring (implying we haven't lost track of the sockets). 👍

@MSeal
Copy link
Contributor

MSeal commented May 26, 2020

@minrk ! Glad to see you on more threads again :)

Yeah as @kevin-bates pointed out we definitely don't want to go back to globals as they break all concurrency in higher abstractions, causing deadlocks against the global state. So it looks like we need to dig into zmq context cleanup more to figure out why we're not cleaning the FDs up under ideal exit conditions?

@MSeal
Copy link
Contributor

MSeal commented May 26, 2020

Or did I misunderstand the latest comments? Is it that we still see FD leaks with this PR or that this PR resolves them similarly to switching back to a global context?

@kevin-bates
Copy link
Member

We no longer see the leaks with this PR or if we were to switch back to the global context as Min suggested.

My primary concern is compatibility with existing installations. After thinking further I think we need at least two releases - a backport to 5.3.x and a current release. I'd prefer we have a compatible fix in 6.1 as well, then have the incompatible fix in, I guess, 6.2?

@jalpan-randeri
Copy link
Author

Looking at system level, without this fix we are leaking only ZMQbg/Reaper sockets, these seems to be coming from ZMQ context only. We create 5 sockets during kernel startup and we shutdown these 5 sockets at shutdown, Left behinds are only ZMQbg/Reaperone, shutting down zmq.context cleans them up.

The ask was to make ZMQ.context global, which will lead to race conditions, we have exclusive 1:1 zmq context per kernel manager.

This PR fixes it by destroying zmq context at kernel shutdown. I tested with running my script concurrently and did not observe any failures

I am just stuck at making change backward compatible.

@MSeal
Copy link
Contributor

MSeal commented May 26, 2020

I don't think we need to backport to 5.x unless we really want to (and shouldn't block this PR if we do) since Python 2 is fully unsupported now. The 6.x release has been fairly stable relative with only some carryover of 5.x issues that weren't resolved like this one. Do you have some code @kevin-bates that still relies on Python 2 for EG that would need this fix?

What's the incompatible fix vs compatible fix referring to?

I think this would be a 6.2 release since there's some minor contract changes and a cleanup change.

Are we good with a merge for now and follow up with release strategy?

@kevin-bates
Copy link
Member

I agree with moving forward and merging now, but I believe we need at least one backport.

Do you have some code @kevin-bates that still relies on Python 2 for EG that would need this fix?

We have users that are running EG relative to 5.3.4. I don't know if they're running Python 2 but when taking master to check this change out, they ran into another 6.x change (port caching) that breaks them and they shouldn't have to take a new EG release to have their leaks fixed. Enterprise users are slow-moving, so would really prefer an option for them via 5.3.5.

What's the incompatible fix vs compatible fix referring to?

I may not be understanding python bindings correctly, but when moved into cleanup() it replaces the positional parameter and flips its default value from True to False. This is incompatible with subclasses since they will be calling the superclass relative to the old parameter value (that defaults to True) - not to mention I think there would be a positional argument mismatch. As a result, applications that subclass KernelManager and override cleanup() will break.

I believe the "compatible" change needs to either be the original commit (that lies outside cleanup()) or placed in cleanup(), but with the current parameter (and its default) retained.

There may be other applications besides EG affected by this and those applications shouldn't require their own patch releases to resolve their leaks, IMO.

FWIW, I just pulled the candidate fix w/o changing EG and this manifests itself as a TypeError due to the positional argument change.

[E 200526 12:47:55 web:1670] Uncaught exception DELETE /api/kernels/99367f67-c6f5-4029-8bbf-c9ad1d82afa1 (9.160.104.100)
    HTTPServerRequest(protocol='http', host='yarn-eg-node-1.fyre.ibm.com:8888', method='DELETE', uri='/api/kernels/99367f67-c6f5-4029-8bbf-c9ad1d82afa1', version='HTTP/1.1', remote_ip='9.160.104.100')
    Traceback (most recent call last):
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/web.py", line 1592, in _execute
        result = yield result
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
        value = future.result()
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
        yielded = next(result)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 66, in delete
        yield maybe_future(km.shutdown_kernel(kernel_id))
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/notebook/services/kernels/kernelmanager.py", line 302, in shutdown_kernel
        return super(MappingKernelManager, self).shutdown_kernel(kernel_id, now=now)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/jupyter_client/multikernelmanager.py", line 183, in shutdown_kernel
        km.shutdown_kernel(now=now, restart=restart)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/jupyter_client/manager.py", line 381, in shutdown_kernel
        self.cleanup(restart=restart)
    TypeError: cleanup() got an unexpected keyword argument 'restart'

@MSeal
Copy link
Contributor

MSeal commented May 26, 2020

Ahh I see.

/opt/anaconda2/envs/py3/lib/python3.6/site-packages/jupyter_client/multikernelmanager.py", line 183, in shutdown_kernel
km.shutdown_kernel(now=now, restart=restart)

That just looks like the shutdown_kernel call was missed in the PR's changes: https://github.com/jupyter/jupyter_client/blob/master/jupyter_client/multikernelmanager.py#L183

We can try making the interface more backwards compatible before releasing. Should we just accept connection_file as an argument mapped to not restart with a deprecation warning printed that this will be removed in... 7.0? Should make it completely compatible and clear to users what to change to in the future.

@MSeal
Copy link
Contributor

MSeal commented May 26, 2020

Enterprise users are slow-moving, so would really prefer an option for them via 5.3.5.

Alright, if you think this change is important for those users we can make a back-port for this after we get a 6.x release.

@kevin-bates
Copy link
Member

Thank you Matthew - this change is critical for EG users since EG represents a long-running server who's sole purpose in life is to create and shutdown kernels and leaks like these add up quickly.

Regarding the change itself, shutdown_kernel() is fine. It's all about the signature change on cleanup() that's the compatibility issue. I agree that leaving connection_file=True in place with a deprecation warning is the best way forward. I'm sorry that's the case but it's the correct thing to do.

@MSeal
Copy link
Contributor

MSeal commented May 26, 2020

@jalpan-randeri Would you mind adding the backwards comparability for connection_file with a warning to the PR? Or would you prefer we tackle that after merge?

@minrk
Copy link
Member

minrk commented Jun 8, 2020

This diff applies the following changes:

  • destroy context in __del__ instead of cleanup_resources (having a cleanup method makes sense to me, but doing this action in shutdown seems less clear)
  • share context across children by default in MultiKernelManager

and has no warnings with lab or classic in my tests (because contexts are never terminated)

What do folks think?

@MSeal
Copy link
Contributor

MSeal commented Jun 8, 2020

destroy context in del instead of cleanup_resources

This won't be guaranteed to be called until after all references are cleaned, and even then it's not guaranteed to run if the process is exiting. So if someone has a local kc = km.client snippet in their code the cleanup behavior would stop executing. Typically I've seen with IO cleanup, you explicitly call it when you're done with that IO and make the del method call as well just in case for this reason.

share context across children by default in MultiKernelManager

That seems like a good idea to add.

@minrk
Copy link
Member

minrk commented Jun 12, 2020

What's the best way forward? It does make sense to me to have explicit cleanup, not relying on del, but shutdown does not seem like the right place to close a context (it's quite reasonable to close sockets after shutdown, and we see that jupyterlab does exactly that, hence the errors reported above). We have already that 'explicit cleanup' is manager.context.term(), but a wrapper method that closes everything we might have opened is a good idea, too. Should we stop calling cleanup_resources() in shutdown and make it an explicit call for consumers of the KernelManager API?

So maybe the right thing is to keep this as-is—manager terminates the context on shutdown when the context is not shared, and then ensure for contexts like the notebook server that the context is shared (cherry-pick 4cdbf54 onto this PR)? We'll get implicit cleanup on shutdown in single KernelManager use cases, one context by default in MultiKernel cases (notebook servers, most notably).

@MSeal
Copy link
Contributor

MSeal commented Jun 14, 2020

So maybe the right thing is to keep this as-is—manager terminates the context on shutdown when the context is not shared, and then ensure for contexts like the notebook server that the context is shared (cherry-pick 4cdbf54 onto this PR)? We'll get implicit cleanup on shutdown in single KernelManager use cases, one context by default in MultiKernel cases (notebook servers, most notably).

I think that could be a working plan. I think we want to emphasize that client cleanup and if they don't we do a best effort to cleanup for them. That should cover current behavior for lab and classic unless I am mistaken, and we can ask that lab move to a more deliberate cleanup mode over time? Unless I missed anything?

@MSeal
Copy link
Contributor

MSeal commented Jun 14, 2020

Should one of us apply the commit to this PR and get it so folks can test it out to confirm behavior before merge?

@jalpan-randeri
Copy link
Author

just to confirm, we want to provide cleanup_resources as explicit call and not do it as part of shutdown method. something like this commit 4cdbf54

@minrk
Copy link
Member

minrk commented Jun 15, 2020

@jalpan-randeri yes, please cherry-pick that commit onto your PR. Then I think we can proceed with testing.

@MSeal to be clear, I believe lab is doing nothing wrong - explicitly closing sockets after shutdown in general should be fine, so I don't think we need to ask them to change. If there is a change to make forcing the closure of sockets before shutdown, that would belong in the notebook server, collecting and closing all WebSocketHandlers associated with a kernel as part of shutdown.

@jalpan-randeri
Copy link
Author

cherry-picking commit gave me following error

Not sure if this is expected behavior

AttributeError: 'super' object has no attribute '__del__'
Exception ignored in: <function MultiKernelManager.__del__ at 0x10a0d9290>
Traceback (most recent call last):
  File "/jupyter_client/jupyter_client/multikernelmanager.py", line 132, in __del__
    super().__del__()
AttributeError: 'super' object has no attribute '__del__'

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have the one comment regarding the warning message.

I tried this out using EnterpriseGateway and it works great - other than not seeing any warnings. I do not see leaks happening nor do I see issues stemming from JupyterLab closing the WebSocket after shutdown. Two thumbs up! 👍 👍

def cleanup(self, connection_file=True):
"""Clean up resources when the kernel is shut down"""
warnings.warn("Method cleanup(connection_file=True) is deprecated, use cleanup_resources(restart=False).",
DeprecationWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in this previous comment, I'm not seeing any warning messages produced in the server console/logs. I think we should adjust the class such that it produces at least one message. (Using FutureWarning does just that.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@MSeal MSeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with a merge now. Will let kevin or min do the merge here though.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let's let @minrk confirm.

@jalpan-randeri jalpan-randeri requested a review from minrk June 20, 2020 00:20
@minrk minrk merged commit b5dfd16 into jupyter:master Jun 22, 2020
@minrk
Copy link
Member

minrk commented Jun 22, 2020

Looks great to me, thanks everyone!

@minrk
Copy link
Member

minrk commented Jun 22, 2020

@kevin-bates if you want to do a backport in order to make sure more versions of the notebook server get it, backporting just the shared context for MultiKernelManager should be the smallest way to achieve that.

@kevin-bates
Copy link
Member

Thanks @minrk. I'm kinda new to back-porting, but it sounds like I'd cherry-pick 2d5ba4b and 1ce1d97 to a 5.x branch. Since no branch for back-ports exists, would we just create 5.x from the 5.3.4 tag? or should that branch be named 5.3.x?

If someone could set up the target branch, I'd be happy to back-port the appropriate commits and help get a 5.3.5 out.

@minrk
Copy link
Member

minrk commented Jun 25, 2020

5.x branch is now at 5.3.4, if you want to make the PR backporting those two commits. I don't think we need A.B.x branches except where we are backporting past the latest minor revision for a given major revision (e.g. 5.2.x).

@MSeal
Copy link
Contributor

MSeal commented Jun 25, 2020

I can help kick off a releases for these things. Shall I do the 6.1.4 release now?

@MSeal
Copy link
Contributor

MSeal commented Jun 30, 2020

6.1.5 and 5.3.5 are both released with this change included. I messed up the 6.1.4 upload (had the wrong commits) and deleted that release right away.

@kevin-bates
Copy link
Member

Thank you Matt!

@jalpan-randeri
Copy link
Author

wow!. thank you Kevin, Matt, Min RK 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants