Log filled with "Exception occurred while Subscriber handling stream: Already reading" #49147

furgerf · 2018-08-16T07:55:02Z

Description of Issue/Question

Hi, on a server running salt-minion and salt-master, the minion's logfile was today filled (quite quickly) with the following message:

2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,764 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading

Stopping the minion service didn't stop the process writing to the logfile, neither did SIGTERM - I only got rid of it with SIGKILL (PID 25008 here):

$ ps aux | grep [p]ython
root     24851  0.0  0.1 101600  6048 ?        S    Aug10   0:00 /usr/bin/python /usr/bin/salt-minion
root     24854  0.0  0.2 548420 11264 ?        Sl   Aug10   0:09 /usr/bin/python /usr/bin/salt-minion
root     25005  0.0  0.1 101600  6412 ?        S    Aug10   0:00 /usr/bin/python /usr/bin/salt-minion
root     25008 99.9  0.6 549964 27508 ?        Sl   Aug10 8403:05 /usr/bin/python /usr/bin/salt-minion

Setup

I don't know what's relevant here...

Steps to Reproduce Issue

Don't know either :/

Versions Report

Salt Version:
           Salt: 2017.7.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.6 (default, Nov 23 2017, 15:49:48)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 3.19.0-80-generic
         system: Linux
        version: Ubuntu 14.04 trusty

Same version of minion, master, and common.

The text was updated successfully, but these errors were encountered:

Ch3LL · 2018-08-16T15:04:37Z

was there possibly anything before these log messages appeared that might be more telling as to why this occurred?

Any other reason as to why this would have occured? Where you running a particular command, state, etc?

furgerf · 2018-08-16T15:16:47Z

As far as I know, no manual activity caused this. This looks like the beginning, there don't seem to be any earlier messages:

2018-08-10 13:01:57,661 [salt.log.setup   ][WARNING ][24854] Temporary logging is already configured
2018-08-10 13:01:57,720 [salt.log.setup   ][WARNING ][24854] Console logging already configured
2018-08-10 13:01:57,721 [salt.log.setup   ][WARNING ][24854] Console logging already configured
2018-08-10 13:01:57,723 [salt.log.setup   ][WARNING ][24854] Logfile logging already configured
2018-08-10 13:01:57,727 [salt.log.setup   ][ERROR   ][24854] An un-handled exception was caught by salt's global exception handler:
IOError: [Errno 5] Input/output error
Traceback (most recent call last):
  File "/usr/bin/salt-minion", line 26, in <module>
    salt_minion()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 204, in salt_minion
    minion.start()
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 340, in start
    super(Minion, self).start()
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 1032, in start
    self.prepare()
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 299, in prepare
    self.shutdown(1)
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 397, in shutdown
    self.__class__.__name__, (exitmsg or '')).strip()))
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 1045, in shutdown
    self.exit(exitcode, exitmsg)
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 268, in exit
    optparse.OptionParser.exit(self, status, msg)
  File "/usr/lib/python2.7/optparse.py", line 1573, in exit
    sys.stderr.write(msg)
IOError: [Errno 5] Input/output error
2018-08-10 13:02:01,413 [salt.log.setup   ][WARNING ][25008] Temporary logging is already configured
2018-08-10 13:02:01,473 [salt.log.setup   ][WARNING ][25008] Console logging already configured
2018-08-10 13:02:01,473 [salt.log.setup   ][WARNING ][25008] Console logging already configured
2018-08-10 13:02:01,475 [salt.log.setup   ][WARNING ][25008] Logfile logging already configured
2018-08-10 13:02:01,477 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-10 13:02:01,477 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading

The master's log file contains many stacktraces because it's unable to create temp files, as well as messages like those:

2018-08-14 11:57:41,801 [salt.utils.gitfs ][WARNING ][13836] Update lock file is present for gitfs remote 'http://XXX', skipping. If this warning persists, it is possible that the update process was interrupted, but the lock could also have been manually set. Removing /var/cache/salt/master/gitfs/67557cee86485c172576f530bfb03bbdb4c87c5971858ce20b19388e07b85606/.git/update.lk or running 'salt-run cache.clear_git_lock gitfs type=update' will allow updates to continue for this remote.
2018-08-14 11:58:42,687 [salt.utils.gitfs ][WARNING ][13836] gitfs_global_lock is enabled and update lockfile /var/cache/salt/master/gitfs/67557cee86485c172576f530bfb03bbdb4c87c5971858ce20b19388e07b85606/.git/update.lk is present for gitfs remote 'http://XXX'.

Sorry, I could've added that right at the start, hope this helps!

Ch3LL · 2018-08-17T14:19:04Z

ping @saltstack/team-transport any ideas here? or follow up questions since this might be difficult to replicate.

DmitryKuzmenko · 2018-08-20T11:46:36Z

This is a race condition somewhere in the event bus client usage on minion side. We use the IPCMessageSubscriber to read the event bus. It's a singleton so there could be 2 instances of the bus client that uses one subscriber and since it's asynchronous, one tries to read while another one didn't done yet. I can't say more without detailed analyze.
@furgerf is it easily reproducible on your side? Is it possible to ask you to reproduce this on patched minion code to get better logging?

furgerf · 2018-08-20T12:45:27Z

Hi @DmitryKuzmenko, no unfortunately I have no idea how to reproduce it. By your description, it sounds like you might be able to trigger this more easily by creating a bunch of bus clients and having them read from the same subscriber if that's an option in the relevant code.
Let me know if I can help with more info.

DmitryKuzmenko · 2018-08-20T13:52:50Z

@furgerf it's OK.
What about the problem, I think we could try to catch this by code review. There are not many places where that subscriber is used.
Did you do something on that minion right before? I see the strange exception report in your log 4 seconds before the issue happened. Did you restarted minion manually or by cron? Did minion execute some state or schedule task?

@Ch3LL I propose to mark it as a bug and put to the backlog. If we'd have more reports here, we could increase the severity of this.

furgerf · 2018-08-21T05:45:31Z

@DmitryKuzmenko I've looked into the logs and see that there was some activity before the problem started, someone ran salt '*' cmd.run 'curl ifconfig.me'. There were no restarts though.

Like I mentioned before, this server contains both a minion and a master; the master manages around 40 minions in total. About half of them are currently connected.

DmitryKuzmenko · 2018-08-21T11:31:36Z

@furgerf thank you for the info. I'll review the related code when I'll have time. If you'll reproduce it again any new information is very appreciated.

viq · 2018-11-09T10:38:25Z

Any news with this? I was just bitten by it on a couple of minions, eg

# salt-call --versions-report
Salt Version:
           Salt: 2018.3.3
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.9 (default, Sep 25 2018, 20:42:16)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.0.5
 
System Versions:
           dist: debian 8.11 
         locale: UTF-8
        machine: x86_64
        release: 3.16.0-7-amd64
         system: Linux
        version: debian 8.11

rochen-joshua · 2018-11-14T20:00:41Z

Getting the same here in our setup, about 10 or so of our minions are getting this and filling up drive space rapidly.

rochen-joshua · 2018-11-14T20:01:20Z

FYI - We have about 300 or so minions, so it's not global, and reinstalling did not fix issue.

tomwalsh · 2018-11-20T13:57:41Z

Started getting this as well. Only a few minions, but it fills up the drive rather quickly causing other issues.

Ch3LL · 2018-11-26T16:36:06Z

@DmitryKuzmenko seems more people are reporting this so i will increase the severity. any idea as to when you can get to this issue? I know its currently on your backlog due to higher priority issues.

DmitryKuzmenko · 2018-11-27T20:50:26Z

Had some time today to work on this. I confirmed my thinking I wrote here above. Tomorrow I'll provide the description and solution.

DmitryKuzmenko · 2018-11-28T21:18:23Z

I have no any code yet, but I promised to post an update and I'm doing this: the main problem is that IPCMessageSubscriber is a singleton. It's used by event that could be instantiated anywhere just to wait for some event, handle it and forget. So we can have a number of event instances at once. All of them will share one subscriber. If a number of instances run read_async simultaneously all will call stream.read_bytes that will fail for all subsequent calls if the first one isn't done yet. Moreover it conforms with the subscriber pattern, all the instance have to get the same data if some received.

The simplest way to fix this is to make the IPCMessageSubscriber non-sigleton. Here we need to make IPCClient a non-singleton and slightly rewrite IPCMessageClient to implement singleton by itself (or make it non-singleton too). This could have a performance impact as well.

One more of my minds is to split singleton to non-singleton handler class strong referencing the singleton instance and singleton weak referencing the handlers. This will allow singleton to know who is waiting for the event data and pass data to all event objects.

Let me some time to think about the best solution. I'll provide one soon.

sfozz · 2019-02-26T19:14:03Z

+1 here, we're seeing it on a selection of CentOS7 systems as well. Interestingly the most recent was also running cpanel.

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Oct 30 2018, 23:45:53)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.5.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

DejanKolar · 2019-03-20T07:40:30Z

Updating kernel to latest release didn't solve the problem.

DmitryKuzmenko · 2019-03-20T10:47:33Z

@DejanKolar the PR is still not merged yet. I'm waiting for the CI system fix to re-run tests. I expect this to be done in a couple of days.

gsagwan · 2019-04-07T17:30:25Z

I am also facing the same issue after upgrading to 2018.3.3

sfishback · 2019-04-11T20:02:43Z

I observed the same issue after upgrading from salt-2018.3.0-1.el7 to salt-2019.2.0-1.el7 today. Our monitoring noticed /var was almost full. Doing a "systemctl stop salt-minion.service" did not work. I had to do a "kill -9" on two remaining orphaned salt-minion daemons to get the error messages in /var/log/salt/minion to stop.

sfishback · 2019-04-11T20:06:08Z

Also, for what it's worth...
Running RHEL7 kernel 3.10.0-957.el7.x86_64. Kernel kernel-3.10.0-957.10.1.el7.x86_64 is installed but the system hasn't been rebooted yet to apply.

jimmyat · 2019-05-01T05:41:54Z

I'm still seeing this issue on a large number of servers- any idea when the fix will be available?

afletch · 2019-05-01T08:37:38Z

Perhaps @DmitryKuzmenko can update everyone on which release this will get into

sfozz · 2019-05-01T22:38:43Z

I've added in the following on cpanel servers which appears to have addressed this:

cpanel_local:
  file.directory:
    - name:     /etc/cpanel/local
    - makedirs: True

cpanel_ignored:
  file.managed:
    - name:     /etc/cpanel/local/ignore_outdated_services
    - contents:
      - salt-minion

DmitryKuzmenko · 2019-05-07T00:40:04Z

Currently the fix is in all branches (2018.3, 2019.2 and develop). It will be released in the nearest builds.

anitakrueger · 2019-05-14T18:41:49Z

Could you reference the PR or commit by any chance? Got bitten by this today and it'd be interesting what the fix looks like. Thanks!

DmitryKuzmenko · 2019-05-14T20:10:10Z

@anitakrueger that was a long way finally done with #52445 and #52564

anitakrueger · 2019-05-14T21:19:29Z

Awesome, much appreciated!

elipsion · 2019-05-22T07:36:14Z

We also encountered this late last evening, roughly 2~4% of our 250 minions were affected.

Any idea of when the fix will hit http://repo.saltstack.com/?

DmitryKuzmenko · 2019-05-23T12:32:14Z

@elipsion in June I think.

doesitblend · 2019-06-28T14:45:34Z

ZD-3869

perfecto25 · 2019-10-17T15:26:59Z

Had the same issue running salt 2018.3.4 on Centos 7

It kept writing to minion log every millisecond, and caused one of my hosts to run out of disk space,

I think the reason was multiple salt procs running at same time, you can see PID 1114 has a strange "1-20:07.." in the timespan column, maybe a zombi?

                                                                                                                                                                                                                                                                                root@testbox /var# ps -ef | grep salt
root      1114     1 30 Oct11 ?        1-20:07:34 /usr/bin/python /usr/bin/salt-minion
root     10122  1114  0 Oct15 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion
root     25791     1  7 11:18 ?        00:00:14 /usr/bin/python /usr/bin/salt-minion
root     26009     1  0 11:19 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion
root     26165 24719 10 11:21 pts/0    00:00:00 grep --color=auto salt

killing all salt procs and restarting minion fixed the issue, but luckily this was a testbox, if was production and this caused a disk space issue, would be much bigger problem.

Not sure how to reproduce it again

nlsrchtr · 2019-10-25T08:48:06Z

@elipsion in June I think.

Since this is ticket is still "open", I would assume this is not released for now, right? Or can you send us the version number to update towards, please?

DmitryKuzmenko · 2019-11-21T22:56:21Z

The fix for this is in the codebase and released with 2019.2.1 as I can see.

Ch3LL added the info-needed waiting for more info label Aug 16, 2018

Ch3LL added this to the Blocked milestone Aug 16, 2018

Ch3LL added Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged and removed info-needed waiting for more info labels Aug 17, 2018

Ch3LL added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists P4 Priority 4 and removed Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged labels Aug 20, 2018

Ch3LL modified the milestones: Blocked, Approved Aug 20, 2018

Ch3LL assigned DmitryKuzmenko Aug 20, 2018

Ch3LL added severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P3 Priority 3 and removed severity-low 4th level, cosemtic problems, work around exists P4 Priority 4 labels Nov 26, 2018

DmitryKuzmenko mentioned this issue Mar 4, 2019

Allow multiple instances of IPCMessageSubscriber in one process #51963

Merged

KChandrashekhar modified the milestones: Approved, March 15th Mar 6, 2019

KChandrashekhar modified the milestones: March 15th, March 29th Mar 18, 2019

KChandrashekhar modified the milestones: March 29th, Priority Apr 5, 2019

doesitblend added the ZD The issue is related to a Zendesk customer support ticket. label Jun 28, 2019

garethgreenaway added the fixed-pls-verify fix is linked, bug author to confirm fix label Jul 1, 2019

DmitryKuzmenko closed this as completed Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log filled with "Exception occurred while Subscriber handling stream: Already reading" #49147

Log filled with "Exception occurred while Subscriber handling stream: Already reading" #49147

furgerf commented Aug 16, 2018

Ch3LL commented Aug 16, 2018

furgerf commented Aug 16, 2018

Ch3LL commented Aug 17, 2018

DmitryKuzmenko commented Aug 20, 2018

furgerf commented Aug 20, 2018

DmitryKuzmenko commented Aug 20, 2018

furgerf commented Aug 21, 2018

DmitryKuzmenko commented Aug 21, 2018

viq commented Nov 9, 2018

rochen-joshua commented Nov 14, 2018

rochen-joshua commented Nov 14, 2018

tomwalsh commented Nov 20, 2018

Ch3LL commented Nov 26, 2018

DmitryKuzmenko commented Nov 27, 2018

DmitryKuzmenko commented Nov 28, 2018 •

edited

Loading

sfozz commented Feb 26, 2019

DejanKolar commented Mar 20, 2019

DmitryKuzmenko commented Mar 20, 2019

gsagwan commented Apr 7, 2019

sfishback commented Apr 11, 2019

sfishback commented Apr 11, 2019

jimmyat commented May 1, 2019

afletch commented May 1, 2019

sfozz commented May 1, 2019

DmitryKuzmenko commented May 7, 2019

anitakrueger commented May 14, 2019

DmitryKuzmenko commented May 14, 2019 •

edited

Loading

anitakrueger commented May 14, 2019

elipsion commented May 22, 2019 •

edited

Loading

DmitryKuzmenko commented May 23, 2019

doesitblend commented Jun 28, 2019

perfecto25 commented Oct 17, 2019

nlsrchtr commented Oct 25, 2019

DmitryKuzmenko commented Nov 21, 2019

Log filled with "Exception occurred while Subscriber handling stream: Already reading" #49147

Log filled with "Exception occurred while Subscriber handling stream: Already reading" #49147

Comments

furgerf commented Aug 16, 2018

Description of Issue/Question

Setup

Steps to Reproduce Issue

Versions Report

Ch3LL commented Aug 16, 2018

furgerf commented Aug 16, 2018

Ch3LL commented Aug 17, 2018

DmitryKuzmenko commented Aug 20, 2018

furgerf commented Aug 20, 2018

DmitryKuzmenko commented Aug 20, 2018

furgerf commented Aug 21, 2018

DmitryKuzmenko commented Aug 21, 2018

viq commented Nov 9, 2018

rochen-joshua commented Nov 14, 2018

rochen-joshua commented Nov 14, 2018

tomwalsh commented Nov 20, 2018

Ch3LL commented Nov 26, 2018

DmitryKuzmenko commented Nov 27, 2018

DmitryKuzmenko commented Nov 28, 2018 • edited Loading

sfozz commented Feb 26, 2019

DejanKolar commented Mar 20, 2019

DmitryKuzmenko commented Mar 20, 2019

gsagwan commented Apr 7, 2019

sfishback commented Apr 11, 2019

sfishback commented Apr 11, 2019

jimmyat commented May 1, 2019

afletch commented May 1, 2019

sfozz commented May 1, 2019

DmitryKuzmenko commented May 7, 2019

anitakrueger commented May 14, 2019

DmitryKuzmenko commented May 14, 2019 • edited Loading

anitakrueger commented May 14, 2019

elipsion commented May 22, 2019 • edited Loading

DmitryKuzmenko commented May 23, 2019

doesitblend commented Jun 28, 2019

perfecto25 commented Oct 17, 2019

nlsrchtr commented Oct 25, 2019

DmitryKuzmenko commented Nov 21, 2019

DmitryKuzmenko commented Nov 28, 2018 •

edited

Loading

DmitryKuzmenko commented May 14, 2019 •

edited

Loading

elipsion commented May 22, 2019 •

edited

Loading