Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log filled with "Exception occurred while Subscriber handling stream: Already reading" #49147

Closed
furgerf opened this issue Aug 16, 2018 · 47 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZD The issue is related to a Zendesk customer support ticket.
Milestone

Comments

@furgerf
Copy link

furgerf commented Aug 16, 2018

Description of Issue/Question

Hi, on a server running salt-minion and salt-master, the minion's logfile was today filled (quite quickly) with the following message:

2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,764 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading

Stopping the minion service didn't stop the process writing to the logfile, neither did SIGTERM - I only got rid of it with SIGKILL (PID 25008 here):

$ ps aux | grep [p]ython
root     24851  0.0  0.1 101600  6048 ?        S    Aug10   0:00 /usr/bin/python /usr/bin/salt-minion
root     24854  0.0  0.2 548420 11264 ?        Sl   Aug10   0:09 /usr/bin/python /usr/bin/salt-minion
root     25005  0.0  0.1 101600  6412 ?        S    Aug10   0:00 /usr/bin/python /usr/bin/salt-minion
root     25008 99.9  0.6 549964 27508 ?        Sl   Aug10 8403:05 /usr/bin/python /usr/bin/salt-minion

Setup

I don't know what's relevant here...

Steps to Reproduce Issue

Don't know either :/

Versions Report

Salt Version:
           Salt: 2017.7.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.6 (default, Nov 23 2017, 15:49:48)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 3.19.0-80-generic
         system: Linux
        version: Ubuntu 14.04 trusty

Same version of minion, master, and common.

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 16, 2018

was there possibly anything before these log messages appeared that might be more telling as to why this occurred?

Any other reason as to why this would have occured? Where you running a particular command, state, etc?

@Ch3LL Ch3LL added the info-needed waiting for more info label Aug 16, 2018
@Ch3LL Ch3LL added this to the Blocked milestone Aug 16, 2018
@furgerf
Copy link
Author

furgerf commented Aug 16, 2018

As far as I know, no manual activity caused this. This looks like the beginning, there don't seem to be any earlier messages:

2018-08-10 13:01:57,661 [salt.log.setup   ][WARNING ][24854] Temporary logging is already configured
2018-08-10 13:01:57,720 [salt.log.setup   ][WARNING ][24854] Console logging already configured
2018-08-10 13:01:57,721 [salt.log.setup   ][WARNING ][24854] Console logging already configured
2018-08-10 13:01:57,723 [salt.log.setup   ][WARNING ][24854] Logfile logging already configured
2018-08-10 13:01:57,727 [salt.log.setup   ][ERROR   ][24854] An un-handled exception was caught by salt's global exception handler:
IOError: [Errno 5] Input/output error
Traceback (most recent call last):
  File "/usr/bin/salt-minion", line 26, in <module>
    salt_minion()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 204, in salt_minion
    minion.start()
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 340, in start
    super(Minion, self).start()
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 1032, in start
    self.prepare()
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 299, in prepare
    self.shutdown(1)
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 397, in shutdown
    self.__class__.__name__, (exitmsg or '')).strip()))
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 1045, in shutdown
    self.exit(exitcode, exitmsg)
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 268, in exit
    optparse.OptionParser.exit(self, status, msg)
  File "/usr/lib/python2.7/optparse.py", line 1573, in exit
    sys.stderr.write(msg)
IOError: [Errno 5] Input/output error
2018-08-10 13:02:01,413 [salt.log.setup   ][WARNING ][25008] Temporary logging is already configured
2018-08-10 13:02:01,473 [salt.log.setup   ][WARNING ][25008] Console logging already configured
2018-08-10 13:02:01,473 [salt.log.setup   ][WARNING ][25008] Console logging already configured
2018-08-10 13:02:01,475 [salt.log.setup   ][WARNING ][25008] Logfile logging already configured
2018-08-10 13:02:01,477 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-10 13:02:01,477 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading

The master's log file contains many stacktraces because it's unable to create temp files, as well as messages like those:

2018-08-14 11:57:41,801 [salt.utils.gitfs ][WARNING ][13836] Update lock file is present for gitfs remote 'http://XXX', skipping. If this warning persists, it is possible that the update process was interrupted, but the lock could also have been manually set. Removing /var/cache/salt/master/gitfs/67557cee86485c172576f530bfb03bbdb4c87c5971858ce20b19388e07b85606/.git/update.lk or running 'salt-run cache.clear_git_lock gitfs type=update' will allow updates to continue for this remote.
2018-08-14 11:58:42,687 [salt.utils.gitfs ][WARNING ][13836] gitfs_global_lock is enabled and update lockfile /var/cache/salt/master/gitfs/67557cee86485c172576f530bfb03bbdb4c87c5971858ce20b19388e07b85606/.git/update.lk is present for gitfs remote 'http://XXX'.

Sorry, I could've added that right at the start, hope this helps!

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 17, 2018

ping @saltstack/team-transport any ideas here? or follow up questions since this might be difficult to replicate.

@Ch3LL Ch3LL added Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged and removed info-needed waiting for more info labels Aug 17, 2018
@DmitryKuzmenko
Copy link
Contributor

This is a race condition somewhere in the event bus client usage on minion side. We use the IPCMessageSubscriber to read the event bus. It's a singleton so there could be 2 instances of the bus client that uses one subscriber and since it's asynchronous, one tries to read while another one didn't done yet. I can't say more without detailed analyze.
@furgerf is it easily reproducible on your side? Is it possible to ask you to reproduce this on patched minion code to get better logging?

@furgerf
Copy link
Author

furgerf commented Aug 20, 2018

Hi @DmitryKuzmenko, no unfortunately I have no idea how to reproduce it. By your description, it sounds like you might be able to trigger this more easily by creating a bunch of bus clients and having them read from the same subscriber if that's an option in the relevant code.
Let me know if I can help with more info.

@DmitryKuzmenko
Copy link
Contributor

@furgerf it's OK.
What about the problem, I think we could try to catch this by code review. There are not many places where that subscriber is used.
Did you do something on that minion right before? I see the strange exception report in your log 4 seconds before the issue happened. Did you restarted minion manually or by cron? Did minion execute some state or schedule task?

@Ch3LL I propose to mark it as a bug and put to the backlog. If we'd have more reports here, we could increase the severity of this.

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists P4 Priority 4 and removed Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged labels Aug 20, 2018
@Ch3LL Ch3LL modified the milestones: Blocked, Approved Aug 20, 2018
@furgerf
Copy link
Author

furgerf commented Aug 21, 2018

@DmitryKuzmenko I've looked into the logs and see that there was some activity before the problem started, someone ran salt '*' cmd.run 'curl ifconfig.me'. There were no restarts though.

Like I mentioned before, this server contains both a minion and a master; the master manages around 40 minions in total. About half of them are currently connected.

@DmitryKuzmenko
Copy link
Contributor

@furgerf thank you for the info. I'll review the related code when I'll have time. If you'll reproduce it again any new information is very appreciated.

@viq
Copy link
Contributor

viq commented Nov 9, 2018

Any news with this? I was just bitten by it on a couple of minions, eg

# salt-call --versions-report
Salt Version:
           Salt: 2018.3.3
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.9 (default, Sep 25 2018, 20:42:16)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.0.5
 
System Versions:
           dist: debian 8.11 
         locale: UTF-8
        machine: x86_64
        release: 3.16.0-7-amd64
         system: Linux
        version: debian 8.11 

@rochen-joshua
Copy link

Getting the same here in our setup, about 10 or so of our minions are getting this and filling up drive space rapidly.

@rochen-joshua
Copy link

FYI - We have about 300 or so minions, so it's not global, and reinstalling did not fix issue.

@tomwalsh
Copy link
Contributor

Started getting this as well. Only a few minions, but it fills up the drive rather quickly causing other issues.

@Ch3LL
Copy link
Contributor

Ch3LL commented Nov 26, 2018

@DmitryKuzmenko seems more people are reporting this so i will increase the severity. any idea as to when you can get to this issue? I know its currently on your backlog due to higher priority issues.

@Ch3LL Ch3LL added severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P3 Priority 3 and removed severity-low 4th level, cosemtic problems, work around exists P4 Priority 4 labels Nov 26, 2018
@DmitryKuzmenko
Copy link
Contributor

Had some time today to work on this. I confirmed my thinking I wrote here above. Tomorrow I'll provide the description and solution.

@DmitryKuzmenko
Copy link
Contributor

DmitryKuzmenko commented Nov 28, 2018

I have no any code yet, but I promised to post an update and I'm doing this: the main problem is that IPCMessageSubscriber is a singleton. It's used by event that could be instantiated anywhere just to wait for some event, handle it and forget. So we can have a number of event instances at once. All of them will share one subscriber. If a number of instances run read_async simultaneously all will call stream.read_bytes that will fail for all subsequent calls if the first one isn't done yet. Moreover it conforms with the subscriber pattern, all the instance have to get the same data if some received.

The simplest way to fix this is to make the IPCMessageSubscriber non-sigleton. Here we need to make IPCClient a non-singleton and slightly rewrite IPCMessageClient to implement singleton by itself (or make it non-singleton too). This could have a performance impact as well.

One more of my minds is to split singleton to non-singleton handler class strong referencing the singleton instance and singleton weak referencing the handlers. This will allow singleton to know who is waiting for the event data and pass data to all event objects.

Let me some time to think about the best solution. I'll provide one soon.

@sfozz
Copy link

sfozz commented Feb 26, 2019

+1 here, we're seeing it on a selection of CentOS7 systems as well. Interestingly the most recent was also running cpanel.

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Oct 30 2018, 23:45:53)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.5.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

@DejanKolar
Copy link

Updating kernel to latest release didn't solve the problem.

@DmitryKuzmenko
Copy link
Contributor

@DejanKolar the PR is still not merged yet. I'm waiting for the CI system fix to re-run tests. I expect this to be done in a couple of days.

@KChandrashekhar KChandrashekhar modified the milestones: March 29th, Priority Apr 5, 2019
@gsagwan
Copy link

gsagwan commented Apr 7, 2019

I am also facing the same issue after upgrading to 2018.3.3

@sfishback
Copy link

I observed the same issue after upgrading from salt-2018.3.0-1.el7 to salt-2019.2.0-1.el7 today. Our monitoring noticed /var was almost full. Doing a "systemctl stop salt-minion.service" did not work. I had to do a "kill -9" on two remaining orphaned salt-minion daemons to get the error messages in /var/log/salt/minion to stop.

@sfishback
Copy link

Also, for what it's worth...
Running RHEL7 kernel 3.10.0-957.el7.x86_64. Kernel kernel-3.10.0-957.10.1.el7.x86_64 is installed but the system hasn't been rebooted yet to apply.

@jimmyat
Copy link

jimmyat commented May 1, 2019

I'm still seeing this issue on a large number of servers- any idea when the fix will be available?

@afletch
Copy link
Contributor

afletch commented May 1, 2019

Perhaps @DmitryKuzmenko can update everyone on which release this will get into

@sfozz
Copy link

sfozz commented May 1, 2019

I've added in the following on cpanel servers which appears to have addressed this:

cpanel_local:
  file.directory:
    - name:     /etc/cpanel/local
    - makedirs: True

cpanel_ignored:
  file.managed:
    - name:     /etc/cpanel/local/ignore_outdated_services
    - contents:
      - salt-minion

@DmitryKuzmenko
Copy link
Contributor

Currently the fix is in all branches (2018.3, 2019.2 and develop). It will be released in the nearest builds.

@anitakrueger
Copy link
Contributor

Could you reference the PR or commit by any chance? Got bitten by this today and it'd be interesting what the fix looks like. Thanks!

@DmitryKuzmenko
Copy link
Contributor

DmitryKuzmenko commented May 14, 2019

@anitakrueger that was a long way finally done with #52445 and #52564

@anitakrueger
Copy link
Contributor

Awesome, much appreciated!

@elipsion
Copy link

elipsion commented May 22, 2019

We also encountered this late last evening, roughly 2~4% of our 250 minions were affected.

Any idea of when the fix will hit http://repo.saltstack.com/?

@DmitryKuzmenko
Copy link
Contributor

@elipsion in June I think.

@doesitblend doesitblend added the ZD The issue is related to a Zendesk customer support ticket. label Jun 28, 2019
@doesitblend
Copy link
Collaborator

ZD-3869

@garethgreenaway garethgreenaway added the fixed-pls-verify fix is linked, bug author to confirm fix label Jul 1, 2019
@perfecto25
Copy link
Contributor

Had the same issue running salt 2018.3.4 on Centos 7

It kept writing to minion log every millisecond, and caused one of my hosts to run out of disk space,

I think the reason was multiple salt procs running at same time, you can see PID 1114 has a strange "1-20:07.." in the timespan column, maybe a zombi?

                                                                                                                                                                                                                                                                                root@testbox /var# ps -ef | grep salt
root      1114     1 30 Oct11 ?        1-20:07:34 /usr/bin/python /usr/bin/salt-minion
root     10122  1114  0 Oct15 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion
root     25791     1  7 11:18 ?        00:00:14 /usr/bin/python /usr/bin/salt-minion
root     26009     1  0 11:19 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion
root     26165 24719 10 11:21 pts/0    00:00:00 grep --color=auto salt

killing all salt procs and restarting minion fixed the issue, but luckily this was a testbox, if was production and this caused a disk space issue, would be much bigger problem.

Not sure how to reproduce it again

@nlsrchtr
Copy link

@elipsion in June I think.

Since this is ticket is still "open", I would assume this is not released for now, right? Or can you send us the version number to update towards, please?

@DmitryKuzmenko
Copy link
Contributor

The fix for this is in the codebase and released with 2019.2.1 as I can see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZD The issue is related to a Zendesk customer support ticket.
Projects
None yet
Development

No branches or pull requests