Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaking file descriptors? #494

Open
msimacek opened this issue Dec 1, 2017 · 13 comments
Open

Leaking file descriptors? #494

msimacek opened this issue Dec 1, 2017 · 13 comments

Comments

@msimacek
Copy link
Contributor

msimacek commented Dec 1, 2017

In koschei we have a simple tail_messages loop. Recently it started dying (approximately every hour) with:

IOError: [Errno 24] Too many open files: '/etc/pki/fedmsg/ca.crt'

Looking at /proc/PID/fd when it's still running, it contains mostly the following repeated many times:

lrwx------. 1 koschei koschei 64 Nov 30 18:19 27 -> socket:[3118780725]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 270 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 271 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 272 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 273 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 274 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 275 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 276 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 277 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 278 -> anon_inode:[eventfd]
lrwx------. 1 koschei koschei 64 Nov 30 18:21 279 -> anon_inode:[eventfd]

CC @mizdebsk

@pypingou
Copy link
Member

pypingou commented Dec 1, 2017

Out of curiosity, which fedmsg version do you run?

@msimacek
Copy link
Contributor Author

msimacek commented Dec 1, 2017

fedmsg-0.19.1-1.fc26.noarch

@msimacek
Copy link
Contributor Author

msimacek commented Dec 1, 2017

That seems old, I though infra repos always backported the current version

@puiterwijk
Copy link
Contributor

@msimacek not if the normal stable repos (like, in this case f26) contain newer.

@jeremycline
Copy link
Member

Is this still present in the latest release?

@mizdebsk
Copy link
Member

mizdebsk commented Dec 8, 2017

Yes, it is still reproducible. In 2 hours it leaked several hundreds file descriptors.

[root@koschei-backend01 ~][PROD]# date
Fri Dec  8 10:54:04 UTC 2017
[root@koschei-backend01 ~][PROD]# ps -p 7093 -wo pid,lstart,cmd
  PID                  STARTED CMD
 7093 Fri Dec  8 08:33:28 2017 /usr/libexec/koschei/koschei-watcher -m koschei.backend.main watcher
[root@koschei-backend01 ~][PROD]# ls /proc/7093/fd | wc -l
629
[root@koschei-backend01 ~][PROD]# rpm -q fedmsg
fedmsg-1.0.1-4.fc26.noarch

@mizdebsk
Copy link
Member

mizdebsk commented Dec 8, 2017

Interestingly, fedmsg-tail-3 opens only 11 fds when ran locally, but it immediately has ~250 of them open when I run it from koschei-backend01.stg. fedmsg-tail-2 is much worse - 12 fds locally vs ~600 when ran from koschei-backend01 (prod). Could this be a problem with our fedmsg config (/etc/fedmsg.d/) in Fedora infra?

@jeremycline
Copy link
Member

Ahh, I realized #484 never made it into a release. It rewrites the CRL and CA loading. I'm guessing that will fix this problem (assuming what's leaking is the CRL/CA, which wouldn't surprise me). If it's easy for you to try the tip of the current development branch, that'd be great to confirm.

Interestingly, fedmsg-tail-3 opens only 11 fds when ran locally, but it immediately has ~250 of them open when I run it from koschei-backend01.stg.

I'm guessing if you counted up the "endpoints" in the various configurations it'd come out to 230 or something. zmq will open a TCP connection to each of those. I'm surprised there's such a big difference between Python 2 and 3, though. Off the top of my head I can't think of what would account for that.

@msimacek
Copy link
Contributor Author

msimacek commented Feb 9, 2018

We have upgraded our production to Fedora 27, Python 3 and fedmsg-1.1.0-1.fc27.noarch. The problem still occurs.

@jeremycline
Copy link
Member

Can you provide a minimal reproducer + config? Figuring out when this started would be helpful, as well. Does it occur using m2crypto for validation, or only for cryptography+pyopenssl? Only in Python 3?

@mizdebsk
Copy link
Member

Config: https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/roles/fedmsg/base/templates

Reproducer:

#!/usr/bin/python3
import fedmsg
import os
for _, _, topic, msg in fedmsg.tail_messages():
    nopen = len(os.listdir('/proc/self/fd'))
    print("Received msg topic: {}, number open files: {}".format(topic, nopen))

Example output:

Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 281
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 282
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 284
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 284
Received msg topic: org.fedoraproject.stg.buildsys.task.state.change, number open files: 285

@mizdebsk
Copy link
Member

I left the reproducer script running and after some time it died with:

Traceback (most recent call last):
  File "/tmp/fm-reproducer-494.py", line 5, in <module>
    nopen = len(os.listdir('/proc/self/fd'))
OSError: [Errno 24] Too many open files: '/proc/self/fd'

@jeremycline
Copy link
Member

I was hoping for a config file rather than a set of ansible templates. I cannot reproduce this with the reproducer script and a minimal configuration that points to 'tcp://fedoraproject.org:9940', so having the exact configuration you're running with in a single file would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants