Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 3007 master 4505 publish port failed to respond to salt commands unless restart #66282

Open
5 tasks
tjyang opened this issue Mar 27, 2024 · 21 comments
Open
5 tasks
Assignees
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@tjyang
Copy link
Contributor

tjyang commented Mar 27, 2024

Description

3007 master not responding salt commands unless restarting the master. But it does answer salt-call ran from minion
The timeout issue does not have log in /var/log/salt/master file with log file level set to debug.
Currently trying the trace level.

log_level_logfile: trace

Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

  • [ x] on-prem machine
  • [ x] VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • [ x] onedir packaging
  • [ x] used bootstrap to install

Steps to Reproduce the behavior
$sudo salt minion-on-saltmaster test.ping

Expected behavior
salt-master should answer the test.ping command quickly. In stead,it timeout.

Screenshots

  • cmd line
[me@salt01 ~]$ sudo salt salt01  test.ping
salt01:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240328074306181473


-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 1
# of minions returned: 0
# of minions that did not return: 1
# of minions with errors: 0
-------------------------------------------
ERROR: Minions returned with non-zero exit code
[me@salt01 ~]$

  • and this is trace log in /etc/salt/master when a minion not responding.
2024-03-28 02:36:29,940 [salt.loader.lazy ][DEBUG   ] The functions from module 'no_return' are being loaded by dir() on the loaded module
2024-03-28 02:36:29,940 [salt.utils.lazy  ][DEBUG   ] LazyLoaded no_return.output
2024-03-28 02:36:29,941 [salt.output      ][TRACE   ] data = {'salt01': 'Minion did not return. [No response]\nThe minions may not have all finished running and any remaining minions will return upon completion. 
To look up the return data for this job later, run the following command:\n\nsalt-run jobs.lookup_jid 20240328073609814061'}
  • salt-master status
Mar 27 13:09:15 salt01 salt-master[2162920]: [ERROR   ] Publish server binding pub to /var/run/salt/master/master_event_pub.ipc ssl=None
Mar 27 14:00:04 salt01 salt-master[2163029]: [ERROR   ] Requested method not exposed: _ext_nodes
Mar 27 15:00:04 isalt01 salt-master[2163274]: [ERROR   ] Requested method not exposed: _ext_nodes
<snipped>
Mar 28 03:00:04 salt01 salt-master[2163147]: [ERROR   ] Requested method not exposed: _ext_nodes
Mar 28 04:00:07 salt01 salt-master[2163282]: [ERROR   ] Requested method not exposed: _ext_nodes

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3007.0

Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.16.0
      cherrypy: unknown
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.1
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.3.3
           ZMQ: 4.3.4

Salt Package Information:
  Package Type: onedir

System Versions:
          dist: rocky 8.9 Green Obsidian
        locale: utf-8
       machine: x86_64
       release: 4.18.0-513.18.1.el8_9.x86_64
        system: Linux
       version: Rocky Linux 8.9 Green Obsidian

Additional context
Add any other context about the problem here.

@tjyang tjyang added Bug broken, incorrect, or confusing behavior needs-triage labels Mar 27, 2024
@tjyang tjyang changed the title [BUG] 3007 master not responding salt commands unless restart [BUG] 3007 master not responding to salt commands unless restart Mar 27, 2024
@tjyang
Copy link
Contributor Author

tjyang commented Mar 28, 2024

@KiyoIchikawa, do you still have this issue ?

@tjyang
Copy link
Contributor Author

tjyang commented Mar 28, 2024

  • So push out command doesn't work unless rebooted while answering minion call is fine.
    Following are two logs showing the master vs client action.
/var/log/salt/master log with trace level enabled: after running sudo salt salt01 test.ping command it just time out on itsef , the master
2024-03-28 08:33:40,794 [salt.utils.event ][TRACE   ] _get_event() waited 0 seconds and received nothing
2024-03-28 08:33:40,804 [salt.utils.event ][TRACE   ] Get event. tag: (salt/job|syndic/.*)/20240328133320851221
<snipped>
2024-03-28 08:33:40,886 [salt.utils.event ][TRACE   ] Get event. tag: salt/job/20240328133325868511
2024-03-28 08:33:40,886 [salt.utils.event ][TRACE   ] _get_event() waited 0 seconds and received nothing
2024-03-28 08:33:40,887 [salt.client      ][DEBUG   ] return event: {'salt01': {'failed': True}}
2024-03-28 08:33:40,892 [salt.loader.lazy ][DEBUG   ] The functions from module 'localfs' are being loaded by dir() on the loaded module
2024-03-28 08:33:40,893 [salt.utils.lazy  ][DEBUG   ] LazyLoaded localfs.init_kwargs
2024-03-28 08:33:40,980 [salt.loader.lazy ][DEBUG   ] The functions from module 'localfs' are being loaded by dir() on the loaded module
2024-03-28 08:33:40,980 [salt.utils.lazy  ][DEBUG   ] LazyLoaded localfs.init_kwargs
2024-03-28 08:33:40,982 [salt.loader.lazy ][DEBUG   ] The functions from module 'no_return' are being loaded by dir() on the loaded module
2024-03-28 08:33:40,982 [salt.utils.lazy  ][DEBUG   ] LazyLoaded no_return.output
2024-03-28 08:33:40,982 [salt.output      ][TRACE   ] data = {'salt01': 'Minion did not return. [No response]\nThe minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:\n\nsalt-run jobs.lookup_jid 20240328133320851221'}
2024-03-28 08:33:40,982 [salt.transport.tcp][DEBUG   ] tcp stream to  closed, unable to recv
2024-03-28 08:33:41,973 [salt.utils.process][TRACE   ] Process manager iteration
2024-03-28 08:33:43,754 [salt.utils.process][TRACE   ] Process manager iteration
2024-03-28 08:33:51,983 [salt.utils.process][TRACE   ] Process manager iteration
2024-03-28 08:33:53,765 [salt.utils.process][TRACE   ] Process manager iteratio

/var/log/salt/master log with trace level enabled: after running sudo salt-call test.ping command. salt master is answering querying. ``` 2024-03-28 08:43:33,533 [salt.crypt ][DEBUG ] salt.crypt.get_rsa_key: Loading private key 2024-03-28 08:43:33,533 [salt.crypt ][DEBUG ] salt.crypt.sign_message: Signing message. 2024-03-28 08:43:33,533 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'salt/auth\n\n��resultãact�accept�id�salt01�pub�\x01�-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9xxxxx-masked--Z7HIgb+x+dyYNDe\nb3RFpYx7QFxzHmShDutLMyjfAR26npS8fsN1+nHcoXTkHC6kE2OjC0yc/rp5Crr6\nIQIDAQAB\n-----END PUBLIC KEY-----�_stamp�2024-03-28T13:43:33.532394' 2024-03-28 08:43:33,533 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,533 [salt.master ][TRACE ] Ignore tag salt/auth 2024-03-28 08:43:33,581 [salt.master ][TRACE ] AES payload received with command _pillar 2024-03-28 08:43:33,581 [salt.pillar ][DEBUG ] Determining pillar cache 2024-03-28 08:43:33,600 [salt.loader.lazy ][DEBUG ] The functions from module 'roots' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,600 [salt.utils.lazy ][DEBUG ] LazyLoaded roots.envs 2024-03-28 08:43:33,602 [salt.loader.lazy ][DEBUG ] The functions from module 's3fs' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,603 [salt.utils.lazy ][DEBUG ] Could not LazyLoad roots.init: 'roots.init' is not available. 2024-03-28 08:43:33,623 [salt.loader.lazy ][DEBUG ] The functions from module 'jinja' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,623 [salt.utils.lazy ][DEBUG ] LazyLoaded jinja.render 2024-03-28 08:43:33,624 [salt.loader.lazy ][DEBUG ] The functions from module 'yaml' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,624 [salt.utils.lazy ][DEBUG ] LazyLoaded yaml.render 2024-03-28 08:43:33,631 [salt.master ][TRACE ] Master function call _pillar took 0.05068349838256836 seconds 2024-03-28 08:43:33,632 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'minion/refresh/salt01\n\n��Minion data cache refresh�salt01�_stamp�2024-03-28T13:43:33.630948' 2024-03-28 08:43:33,632 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,632 [salt.master ][TRACE ] Ignore tag minion/refresh/salt01 2024-03-28 08:43:33,785 [salt.master ][TRACE ] AES payload received with command _return 2024-03-28 08:43:33,787 [salt.loaded.int.returner.local_cache][DEBUG ] Adding minions for job 20240328134333786284: ['salt01'] 2024-03-28 08:43:33,788 [salt.utils.job ][INFO ] Got return from salt01 for job 20240328134333786284 2024-03-28 08:43:33,789 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'salt/job/20240328134333786284/ret/salt01\n\n��cmd�_return�id�salt01�jid�20240328134333786284�returnçretcode\x00�fun�test.ping�fun_args��arg��tgt_type�glob�tgt�salt01�_stamp�2024-03-28T13:43:33.788313' 2024-03-28 08:43:33,789 [salt.loaded.int.returner.local_cache][DEBUG ] Reading minion list from /var/cache/salt/master/jobs/1b/c46d64efe0c85f68fb92d7c9137b70db8237f991f39268f4018c423a3aa009/.minions.p 2024-03-28 08:43:33,789 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,789 [salt.master ][TRACE ] Ignore tag salt/job/20240328134333786284/ret/salt01 2024-03-28 08:43:33,789 [salt.master ][TRACE ] Master function call _return took 0.003786802291870117 seconds 2024-03-28 08:43:34,247 [salt.utils.process][TRACE ] Process manager iteration
</details>

@KiyoIchikawa
Copy link

We tested on Windows minions and Linux minions. The Windows minions seem to be working better, but they both seem to be getting the following error(s) while applying a state.test.

Linux Minion (OEL 8)

2024-03-28 13:32:25,967 [salt.utils.extmods:155 ][ERROR   ][1375064] Failed to sync modules module: File client timed out after 60 seconds

Windows (Windows Server 2016 Datacenter)

2024-03-28 12:50:34,269 [salt.loaded.int.beacons.adb:23  ][ERROR   ][14288] Unable to load adb beacon: adb is missing.
2024-03-28 12:50:34,285 [salt.loaded.int.beacons.aix_account:23  ][ERROR   ][14288] Unable to load aix_account beacon: Only available on AIX systems.
2024-03-28 12:50:34,332 [salt.loaded.int.beacons.avahi_announce:62  ][ERROR   ][14288] Unable to load avahi_announce beacon: The 'python-avahi' dependency is missing.
2024-03-28 12:50:34,347 [salt.loaded.int.beacons.bonjour_announce:31  ][ERROR   ][14288] Unable to load bonjour_announce beacon: pybonjour library is missing.
2024-03-28 12:50:34,347 [salt.loaded.int.beacons.btmp:134 ][ERROR   ][14288] Unable to load btmp beacon: /var/log/btmp does not exist.
2024-03-28 12:50:34,519 [salt.loaded.int.beacons.glxinfo:23  ][ERROR   ][14288] Unable to load glxinfo beacon: glxinfo is missing.
2024-03-28 12:50:37,862 [salt.loaded.int.beacons.haproxy:24  ][ERROR   ][14288] Unable to load haproxy beacon: haproxy.get_sessions is missing.
2024-03-28 12:50:37,877 [salt.loaded.int.beacons.inotify:48  ][ERROR   ][14288] Unable to load inotify beacon: pyinotify library is missing.
2024-03-28 12:50:37,904 [salt.loaded.int.beacons.journald:26  ][ERROR   ][14288] Unable to load journald beacon: systemd library is missing.
2024-03-28 12:50:37,921 [salt.loaded.int.beacons.load:20  ][ERROR   ][14288] Unable to load load beacon: Not available for Windows systems.
2024-03-28 12:50:37,931 [salt.loaded.int.beacons.log_beacon:37  ][ERROR   ][14288] Unable to load log beacon: Not available for Windows systems or when regex library is missing.
2024-03-28 12:50:38,016 [salt.loaded.int.beacons.network_settings:86  ][ERROR   ][14288] Unable to load network_settings beacon: pyroute2 library is missing
2024-03-28 12:50:38,079 [salt.loaded.int.beacons.sensehat:26  ][ERROR   ][14288] Unable to load sensehat beacon: sensehat.get_pressure is missing.
2024-03-28 12:50:38,094 [salt.loaded.int.beacons.sh:24  ][ERROR   ][14288] Unable to load sh beacon: strace is missing.
2024-03-28 12:50:38,141 [salt.loaded.int.beacons.smartos_imgadm:42  ][ERROR   ][14288] Unable to load imgadm beacon: Only available on SmartOS compute nodes.
2024-03-28 12:50:38,173 [salt.loaded.int.beacons.smartos_vmadm:42  ][ERROR   ][14288] Unable to load vmadm beacon: Only available on SmartOS compute nodes.
2024-03-28 12:50:38,204 [salt.loaded.int.beacons.telegram_bot_msg:30  ][ERROR   ][14288] Unable to load telegram_bot_msg beacon: telegram library is missing.
2024-03-28 12:50:38,235 [salt.loaded.int.beacons.twilio_txt_msg:31  ][ERROR   ][14288] Unable to load twilio_txt_msg beacon: twilio library is missing.
2024-03-28 12:50:38,235 [salt.loaded.int.beacons.watchdog:77  ][ERROR   ][14288] Unable to load watchdog beacon: watchdog library is missing.
2024-03-28 12:50:38,251 [salt.loaded.int.beacons.wtmp:163 ][ERROR   ][14288] Unable to load wtmp beacon: /var/log/wtmp does not exist.
2024-03-28 12:51:37,367 [salt.utils.extmods:155 ][ERROR   ][19124] Failed to sync states module: File client timed out after 60 seconds
2024-03-28 12:53:55,975 [salt.utils.extmods:155 ][ERROR   ][19124] Failed to sync output module: message authentication failed
2024-03-28 12:55:05,277 [salt.utils.extmods:155 ][ERROR   ][19124] Failed to sync proxy module: File client timed out after 60 seconds
2024-03-28 12:56:14,469 [salt.utils.extmods:155 ][ERROR   ][19124] Failed to sync serializers module: File client timed out after 60 seconds

The Linux minion keeps getting timeouts, I have not increased the log-level on the master yet.

@tjyang
Copy link
Contributor Author

tjyang commented Mar 28, 2024

@KiyoIchikawa , thanks for the update. your case above doesn't fit the issue I am facing here. When salt-master needed restart, it can't even issue command itself(sudo salt saltmaster01 test.ping" . And yet "saltmaster01$sudo salt-call test.ping" works.

@tjyang tjyang changed the title [BUG] 3007 master not responding to salt commands unless restart [BUG] 3007 master 4505 publish port failed to respond to salt commands unless restart Apr 5, 2024
@tjyang
Copy link
Contributor Author

tjyang commented Apr 5, 2024

Here is more info when salt commands can't be sent out.

  • port 4505 failed the simple ncat port check.
[me@salt01 ~]$ date ;nc  -zv  salt01 4505;nc  -zv  salt01 4506
Fri Apr  5 11:09:23 CDT 2024
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: TIMEOUT.
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 100.65.184.160:4506.
Ncat: 0 bytes sent, 0 bytes received in 0.02 seconds.
[me@salt01 ~]$
  • result of netstat -antulp |egrep 4505 salt01 salt-master
# ip address masked
tcp     1001      0 0.0.0.0:4505            0.0.0.0:*               LISTEN      2443253/salt-master
tcp        0      0 1.6.1.0:4505     1.4.43.33:40224      ESTABLISHED 2443253/salt-master
tcp        0      0 1.6.1.0:4505     1.4.116.28:38806     ESTABLISHED 2443253/salt-master
<snipped>
tcp       11      0 1.6.1.0:4505     1.4.81.5:60514       CLOSE_WAIT  -
tcp       11      0 1.6.1.0:4505     1.4.81.5:40492       CLOSE_WAIT  -
tcp        0      0 1.6.1.0:4505     1.4.44.3:52474       ESTABLISHED 2443253/salt-master
tcp        0      0 1.6.1.0:4505     1.5.176.67:43538     ESTABLISHED 2443253/salt-master
tcp       11      0 1.6.1.0:4505     1.4.81.5:49436       CLOSE_WAIT  -
tcp        0      0 1.6.1.0:4505     1.4.226.19:36712     ESTABLISHED 2443253/salt-master
tcp        0      0 1.6.1.0:4505     1.5.178.42:35596     ESTABLISHED 2443253/salt-master
tcp        0      0 1.6.1.0:4505     1.4.11.76:48254      ESTABLISHED 2443253/salt-master
tcp        0      0 1.6.1.0:4505     1.4.29.8:40660       ESTABLISHED 2443253/salt-master

@compi-tom
Copy link

compi-tom commented Apr 12, 2024

Small comment, we're facing the exact same issue since upgrading from 3006.7.
At some point the master stops working and cannot send anything to the minions.

Restarting it doesn't fix anything as it cannot bind to its port:

[salt.transport.tcp:1405][ERROR ][1609] Publish server binding pub to /var/run/salt/master/master_event_pub.ipc ssl=None

EDIT:

  • OS: Ubuntu 22.04
  • Machine type: baremetal
  • Bootstrap used: initially yes, but upgrade from 3006.x to 3007

@tjyang
Copy link
Contributor Author

tjyang commented Apr 12, 2024

@compi-tom,

  • I was in you current state running on RockyLinux 8 for past few weeks. For now I would suggest restart salt-master when zeromq pub 4505 resourced depleted or put a cronjob to restart daily.
  • I was on a VMWare VM with luxury of taking snapshots for rolling back.
  • I tried many things and ended up with an instance that can't even send out test.ping from salt-master to itself after a fresh reboot.
  • In the end, I had to clone another running 3007.0( a two node multi-masters setup) that is more stable that only has this issue occasionally.

@compi-tom
Copy link

We're already running a 3 nodes multi-masters.
Restarting the master service doesn't solve the issue as it cannot bind to the 4505 port ==> reboot.

I'm considering downgrading to 3006.7

@tjyang
Copy link
Contributor Author

tjyang commented Apr 15, 2024

I think I tried downgrade to 3006.7 once and it didn't help this issue so I decided use 3007.0 again. if downloading to older version approach is adopted the minion need to be downgraded also.
Interesting thing is that good 3007 salt-master(one that got cloned) now has same port 4505 malfunction issue. a salt-master stop/start bring up the port 4505.
They are many setting in /etc/salt/master config file. it is time for me to study and understand the meaning zeromq settings.

@tjyang
Copy link
Contributor Author

tjyang commented Apr 16, 2024

I am following this PR: #66335 , not really know the details. but this issue looks like a publish port issue that 66335 trying to fix.

@tjyang
Copy link
Contributor Author

tjyang commented Apr 24, 2024

  • EventPublisher process use 1.6% of 16G, this is the time port 4505 failed.
[me@salt01 srv]$ egrep '286129|MEM' salt-2024-04-24-06\:56.txt
    PID  STARTED     TIME %MEM   RSS    SZ    VSZ  SIZE
 286129 01:00:02 00:00:01  1.6 273528 185245 740980 273156
286129 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
[me@salt01 srv]$
  • so bad that even port 4505 not responding to port poking by ncat.
[me@salt01 srv]$ nc -zv ilclsalt01 4505
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: TIMEOUT.
[me@salt01 srv]$

@tjyang
Copy link
Contributor Author

tjyang commented Apr 25, 2024

  • This is my effort using makefile to automate the tracking of EventPublish memory usage when it stop working.
    ie , putting "make memory" into hourly cronjob to record the memory usage for salt-master sub-process.
[me@salt01 srv]$ egrep 'DATEH|memory|pgrep' Makefile
DATEH           = $(shell date +"%Y-%m-%d-%R")
memory:
        ps  -o pid,start,time,%mem,rss,sz,vsz,size --sort -rss -p $$(pgrep -f salt-master) | head > salt-$(DATEH).txt
        pgrep -af salt-master |grep -v grep >> salt-$(DATEH).txt
[me@salt01 srv]$
  • sample of memory usage on a 16G setting when EventPublisher stopped working when it reached 1.6% of memory.
[me@salt01 srv]$ egrep '1\.6|MEM' salt-2024-04-25-06\:35.txt
    PID  STARTED     TIME %MEM   RSS    SZ    VSZ  SIZE
 896442 01:00:02 00:00:01  1.6 263128 182305 729220 261396  /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
[me@salt01 srv]$

  • This is memory footprints after salt-master got restarted after 2 hours.
[me@salt01 srv]$ cat salt-2024-04-25-09\:06.txt
    PID  STARTED     TIME %MEM   RSS    SZ    VSZ  SIZE
1083486 06:39:33 00:00:01  0.8 131836 277801 1111204 213052
1083479 06:39:33 00:00:02  0.7 118252 165078 660312 128200
1083489 06:39:33 00:00:17  0.5 82504 212527 850108 124024
1083480 06:39:33 00:00:01  0.5 81732 137755 551020 83196
1083501 06:39:33 00:00:19  0.5 81552 212266 849064 122900
1083490 06:39:33 00:00:17  0.5 81484 212239 848956 122908
1083503 06:39:33 00:00:19  0.5 81184 211934 847736 121652
1083505 06:39:33 00:00:18  0.5 81148 212192 848768 122640
1083492 06:39:33 00:00:18  0.5 80784 211915 847660 121340
1083287 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master MainProcess
1083479 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master PubServerChannel._publish_daemon
1083480 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
1083483 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventMonitor
1083485 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer ReqServer_ProcessManager
1083486 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorkerQueue
1083488 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-0
1083489 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-1
1083490 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-2
1083492 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-3
1083501 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-4
1083503 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-5
1083504 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-6
1083505 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-7
1125202 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master FileserverUpdate
1125292 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master Maintenance
[me@salt01 srv]$

@tjyang
Copy link
Contributor Author

tjyang commented Apr 28, 2024

my 2nd salt-master failed to test.ping itself, restart salt-master doesn't work anymore.
I had to reboot the RL8 OS. this definitely smell like kernel network resource depletion.

  • OS restart log below.
[me@salt02 ~]$ sudo salt salt02  test.ping
salt02:
    True


-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 1
# of minions returned: 1
# of minions that did not return: 0
# of minions with errors: 0
-------------------------------------------
[me@salt02 ~]$ uptime
 06:11:19 up 4 min,  2 users,  load average: 0.66, 0.29, 0.12
[me@salt02 ~]$

[me@salt02 ~]$ sudo egrep  "^pub_hwm|^ipc_write_buffer|^salt_event_pub_hwm|^event_publisher_pub_hwm" /etc/salt/master
pub_hwm: 10000
ipc_write_buffer: 'dynamic'
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000
[me@salt02 ~]$ 

@tjyang
Copy link
Contributor Author

tjyang commented May 5, 2024

After downgrade from 3007 to 3006.8 version, my salt-master servers, the 4505 port is now stable and able to send out salt commands.

@compi-tom
Copy link

compi-tom commented May 10, 2024

After downgrade from 3007 to 3006.8 version, my salt-master servers, the 4505 port is now stable and able to send out salt commands.

I've done the same as this error was preventing our app to automatically add VMs to cope with the load. Also, I didn't want to set a daily restart to "bypass" the issue.

@tjyang
Copy link
Contributor Author

tjyang commented May 10, 2024

@compi-tom, Thanks for the confirmation. Hopeful when 3007 turned into LTS from STS. This issue can be acked by core team and resolved.

@amalaguti
Copy link

Any update on this issue ? I had two servers that were working as multimasters nodes thru multiple 3006 versions, just upgraded to 3007.1 and I get this problem in one of the servers only

[DEBUG   ] PubClient connecting to <salt.transport.tcp.PublishClient object at 0x7f721b129130> '/local/data/saltwin/salt-master/3007.1/var/run/salt/master/master_event_pub.ipc'
[DEBUG   ] PubClient conencted to <salt.transport.tcp.PublishClient object at 0x7f721b129130> '/local/data/saltwin/salt-master/3007.1/var/run/salt/master/master_event_pub.ipc'
[DEBUG   ] Subscriber at  connected
[DEBUG   ] tcp stream to  closed, unable to recv

@JonasKrausch
Copy link

We also had to downgrade our master to 3006.8 to resolve this issue.
3007.0 and 3007.1 did not work and we had to restart salt-master multiple times every day.

No logs are written with Loglevel WARNING. Master just refuses to connect to any minion (even itself). Salt-Call works without issues.

@gotsmth
Copy link

gotsmth commented Aug 27, 2024

Hello,

Today i made update from latest 3005 to 3007.1 and i got a lot of messages in master log file:
`2024-08-25 10:10:10,192 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv

2024-08-25 10:10:10,192 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected

2024-08-25 10:10:10,193 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv

2024-08-25 10:10:10,193 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected

2024-08-25 10:10:15,199 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv

2024-08-25 10:10:15,199 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected

2024-08-25 10:10:15,200 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv

2024-08-25 10:10:15,201 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected
`

It seems 3007.1 is not stable and we cannot use it because communication with minions does not work (we have about 2500 minions connected to the master)
After rollback to latest 3006 all works okay.

Just for information that problem still exists.

Any fix around?

@afletch
Copy link
Contributor

afletch commented Aug 27, 2024

See also #66288 and potentially #66715
There are real issues on 3007+ with minion communications and very little in terms of response to all these.

@tjyang
Copy link
Contributor Author

tjyang commented Nov 23, 2024

#65265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

8 participants