-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log filled with "Exception occurred while Subscriber handling stream: Already reading" #49147
Comments
was there possibly anything before these log messages appeared that might be more telling as to why this occurred? Any other reason as to why this would have occured? Where you running a particular command, state, etc? |
As far as I know, no manual activity caused this. This looks like the beginning, there don't seem to be any earlier messages:
The master's log file contains many stacktraces because it's unable to create temp files, as well as messages like those:
Sorry, I could've added that right at the start, hope this helps! |
ping @saltstack/team-transport any ideas here? or follow up questions since this might be difficult to replicate. |
This is a race condition somewhere in the event bus client usage on minion side. We use the IPCMessageSubscriber to read the event bus. It's a singleton so there could be 2 instances of the bus client that uses one subscriber and since it's asynchronous, one tries to read while another one didn't done yet. I can't say more without detailed analyze. |
Hi @DmitryKuzmenko, no unfortunately I have no idea how to reproduce it. By your description, it sounds like you might be able to trigger this more easily by creating a bunch of bus clients and having them read from the same subscriber if that's an option in the relevant code. |
@furgerf it's OK. @Ch3LL I propose to mark it as a bug and put to the backlog. If we'd have more reports here, we could increase the severity of this. |
@DmitryKuzmenko I've looked into the logs and see that there was some activity before the problem started, someone ran Like I mentioned before, this server contains both a minion and a master; the master manages around 40 minions in total. About half of them are currently connected. |
@furgerf thank you for the info. I'll review the related code when I'll have time. If you'll reproduce it again any new information is very appreciated. |
Any news with this? I was just bitten by it on a couple of minions, eg
|
Getting the same here in our setup, about 10 or so of our minions are getting this and filling up drive space rapidly. |
FYI - We have about 300 or so minions, so it's not global, and reinstalling did not fix issue. |
Started getting this as well. Only a few minions, but it fills up the drive rather quickly causing other issues. |
@DmitryKuzmenko seems more people are reporting this so i will increase the severity. any idea as to when you can get to this issue? I know its currently on your backlog due to higher priority issues. |
Had some time today to work on this. I confirmed my thinking I wrote here above. Tomorrow I'll provide the description and solution. |
I have no any code yet, but I promised to post an update and I'm doing this: the main problem is that The simplest way to fix this is to make the One more of my minds is to split singleton to non-singleton handler class strong referencing the singleton instance and singleton weak referencing the handlers. This will allow singleton to know who is waiting for the event data and pass data to all event objects. Let me some time to think about the best solution. I'll provide one soon. |
+1 here, we're seeing it on a selection of CentOS7 systems as well. Interestingly the most recent was also running cpanel.
|
Updating kernel to latest release didn't solve the problem. |
@DejanKolar the PR is still not merged yet. I'm waiting for the CI system fix to re-run tests. I expect this to be done in a couple of days. |
I am also facing the same issue after upgrading to 2018.3.3 |
I observed the same issue after upgrading from salt-2018.3.0-1.el7 to salt-2019.2.0-1.el7 today. Our monitoring noticed /var was almost full. Doing a "systemctl stop salt-minion.service" did not work. I had to do a "kill -9" on two remaining orphaned salt-minion daemons to get the error messages in /var/log/salt/minion to stop. |
Also, for what it's worth... |
I'm still seeing this issue on a large number of servers- any idea when the fix will be available? |
Perhaps @DmitryKuzmenko can update everyone on which release this will get into |
I've added in the following on cpanel servers which appears to have addressed this: cpanel_local:
file.directory:
- name: /etc/cpanel/local
- makedirs: True
cpanel_ignored:
file.managed:
- name: /etc/cpanel/local/ignore_outdated_services
- contents:
- salt-minion |
Currently the fix is in all branches (2018.3, 2019.2 and develop). It will be released in the nearest builds. |
Could you reference the PR or commit by any chance? Got bitten by this today and it'd be interesting what the fix looks like. Thanks! |
@anitakrueger that was a long way finally done with #52445 and #52564 |
Awesome, much appreciated! |
We also encountered this late last evening, roughly 2~4% of our 250 minions were affected. Any idea of when the fix will hit http://repo.saltstack.com/? |
@elipsion in June I think. |
ZD-3869 |
Had the same issue running salt 2018.3.4 on Centos 7 It kept writing to minion log every millisecond, and caused one of my hosts to run out of disk space, I think the reason was multiple salt procs running at same time, you can see PID 1114 has a strange "1-20:07.." in the timespan column, maybe a zombi?
killing all salt procs and restarting minion fixed the issue, but luckily this was a testbox, if was production and this caused a disk space issue, would be much bigger problem. Not sure how to reproduce it again |
Since this is ticket is still "open", I would assume this is not released for now, right? Or can you send us the version number to update towards, please? |
The fix for this is in the codebase and released with 2019.2.1 as I can see. |
Description of Issue/Question
Hi, on a server running salt-minion and salt-master, the minion's logfile was today filled (quite quickly) with the following message:
Stopping the minion service didn't stop the process writing to the logfile, neither did SIGTERM - I only got rid of it with SIGKILL (PID 25008 here):
Setup
I don't know what's relevant here...
Steps to Reproduce Issue
Don't know either :/
Versions Report
Same version of minion, master, and common.
The text was updated successfully, but these errors were encountered: