Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salt Tornado API: salt.exceptions.SaltReqTimeoutError: Message timed out #53147

Open
zbarr opened this issue May 21, 2019 · 23 comments
Open

Salt Tornado API: salt.exceptions.SaltReqTimeoutError: Message timed out #53147

zbarr opened this issue May 21, 2019 · 23 comments
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Grains Salt-API severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@zbarr
Copy link

zbarr commented May 21, 2019

Description of Issue/Question

We just started using the salt API a few months ago and we cannot fix this issue. Currently, we are using Pepper to make the requests. I have created an "async" wrapper around it which runs a local_async command, polls with the jobs.exit_success runner for 30 seconds to a minute, then does a lookup_jid to get the results. Currently, the only thing we use this for is to retrieve all of the grains from the minions and store them in a database (just running grains.items non-batch mode). The output below does not contain the salt API logs from the successful local_async command and successful exit_success runners. It only contains the one that eventually fails.

When using the tornado API, about 2/3 of my "jobs" fail with the following message (trace logging on):

2019-05-16 06:08:28,354 [salt.utils.lazy  :104 ][DEBUG   ][148120] LazyLoaded jobs.exit_success
2019-05-16 06:08:28,354 [salt.transport.zeromq:136 ][DEBUG   ][148120] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'ch0pslop022_master', 'tcp://127.0.0.1:4506', 'clear')
2019-05-16 06:08:28,354 [salt.transport.zeromq:202 ][DEBUG   ][148120] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
2019-05-16 06:08:28,354 [salt.transport.zeromq:1084][DEBUG   ][148120] Trying to connect to: tcp://127.0.0.1:4506
2019-05-16 06:09:28,391 [salt.transport.zeromq:1139][DEBUG   ][148120] SaltReqTimeoutError, retrying. (1/3)
2019-05-16 06:10:28,451 [salt.transport.zeromq:1139][DEBUG   ][148120] SaltReqTimeoutError, retrying. (2/3)
2019-05-16 06:11:28,509 [salt.transport.zeromq:1139][DEBUG   ][148120] SaltReqTimeoutError, retrying. (3/3)
2019-05-16 06:12:28,569 [salt.transport.zeromq:1109][DEBUG   ][148120] Re-init ZMQ socket: Message timed out
2019-05-16 06:12:28,570 [salt.transport.zeromq:1084][DEBUG   ][148120] Trying to connect to: tcp://127.0.0.1:4506
2019-05-16 06:12:28,570 [salt.loaded.int.netapi.rest_tornado.saltnado:926 ][ERROR   ][148120] Unexpected exception while handling request:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/netapi/rest_tornado/saltnado.py", line 919, in disbatch
    chunk_ret = yield getattr(self, '_disbatch_{0}'.format(low['client']))(low)
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python3/dist-packages/salt/netapi/rest_tornado/saltnado.py", line 1114, in _disbatch_runner
    pub_data = self.saltclients['runner'](chunk)
  File "/usr/lib/python3/dist-packages/salt/runner.py", line 126, in cmd_async
    return mixins.AsyncClientMixin.cmd_async(self, reformatted_low)
  File "/usr/lib/python3/dist-packages/salt/client/mixins.py", line 505, in cmd_async
    return self.master_call(**low)
  File "/usr/lib/python3/dist-packages/salt/client/mixins.py", line 140, in master_call
    ret = channel.send(load)
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 64, in wrap
    ret = self._block_future(ret)
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 74, in _block_future
    return future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 331, in send
    ret = yield self._uncrypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 320, in _uncrypted_transfer
    tries=tries,
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)
We have a 3 master multi-master setup. The master which this is running on is a 16 core 128 GB RAM physical host (plenty of hardware).

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)
See description of issue. May be complicated to reproduce but I can hand over more code if needed.

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

[zbarr@saltmaster03: ~]$ salt --versions-report
Salt Version:
           Salt: 2018.3.4

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.5.0
       dateutil: 2.4.2
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.2 (default, Nov 23 2017, 16:37:01)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 17.1.2
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.2.5

System Versions:
           dist: Ubuntu 16.04 xenial
         locale: UTF-8
        machine: x86_64
        release: 4.4.0-119-generic
         system: Linux
        version: Ubuntu 16.04 xenial

Please advise. Thanks!

@twangboy twangboy added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Grains severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Salt-API labels May 21, 2019
@twangboy twangboy added this to the Approved milestone May 21, 2019
@zbarr
Copy link
Author

zbarr commented May 21, 2019

For further detail, we have just over 500 minions in our infrastructure.

@zbarr
Copy link
Author

zbarr commented May 23, 2019

This is ongoing. I've increased the worker threads to 128 from 64 and I've changed the API wrapper to use a lookup_jid instead of exit_success to lighten the load on the salt minions. I also received a longer and more detailed error recently also:

2019-05-23 01:12:38,375 [tornado.access   :1908][INFO    ][14789] 200 POST / (192.168.70.99) 1306.37ms
2019-05-23 01:41:34,270 [salt.loaded.int.netapi.rest_tornado.saltnado:926 ][ERROR   ][14789] Unexpected exception while handling request:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/client/__init__.py", line 1964, in pub_async
    payload = yield channel.send(payload_kwargs, timeout=timeout)
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 331, in send
    ret = yield self._uncrypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 320, in _uncrypted_transfer
    tries=tries,
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/client/__init__.py", line 416, in run_job_async
    **kwargs)
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/client/__init__.py", line 1967, in pub_async
    'Salt request timed out. The master is not responding. You '
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without li
stening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/netapi/rest_tornado/saltnado.py", line 919, in disbatch
    chunk_ret = yield getattr(self, '_disbatch_{0}'.format(low['client']))(low)
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/netapi/rest_tornado/saltnado.py", line 1104, in _disbatch_local_async
    pub_data = yield self.saltclients['local_async'](*f_call.get('args', ()), **f_call.get('kwargs', {}))
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/lib/python3/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3/dist-packages/salt/client/__init__.py", line 428, in run_job_async
    raise SaltClientError(general_exception)
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listen
ing for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

@twangboy
Copy link
Contributor

@dwoz Any ideas here?

@zbarr
Copy link
Author

zbarr commented May 28, 2019

@twangboy @dwoz any help would be appreciated. We have tried pretty much everything we can think of. At this point I am questioning if it is a load issue, although I don't see any reason that this would be too heavy of a load for the hardware we have.

@mattp-
Copy link
Contributor

mattp- commented Jun 20, 2019

@zbarr what config are you running? sock_pool_size can mitigate this among other things and is likely what you're seeing.

@zbarr
Copy link
Author

zbarr commented Jun 20, 2019

Trying that now. Based on the documentation, this may be exactly what I need:

SOCK_POOL_SIZE
Default: 1

To avoid blocking waiting while writing a data to a socket, we support socket pool for Salt applications. For example, a job with a large number of target host list can cause long period blocking waiting. The option is used by ZMQ and TCP transports, and the other transport methods don't need the socket pool by definition. Most of Salt tools, including CLI, are enough to use a single bucket of socket pool. On the other hands, it is highly recommended to set the size of socket pool larger than 1 for other Salt applications, especially Salt API, which must write data to socket concurrently.

sock_pool_size: 15

However, if this works, I am pretty disappointed that it wasn't mentioned in the documentation for rest_tornado. Based on how unreliable my setup was, this should be a requirement to run the salt API.

@harishkapri
Copy link

harishkapri commented Jan 3, 2020

Hi, recently we too are experiencing similar issues in our production single salt master environment. I have exhausted all the resources and no resolution found anywhere.
Any help or guidance is appreciated as its our Production environment.
Steps performed to address the issue

  1. Rebooted the salt master
  2. Restarted the salt master service using service salt-master stop && service salt-master start
  3. No significant load on the salt-master
  4. Tried to perform reinstall of the same version of salt master i.e. 2015.8.7
  5. Also cleared the salt-master cache as well and no avail.

salt-master server settings

cat /etc/salt/master
# Specify the location of the daemon process ID file
pidfile: /var/run/salt/salt-master.pid

# The level of messages to send to the console.
# One of 'garbage', 'trace', 'debug', info', 'warning', 'error', 'critical'.
#log_level: warning

# The level of messages to send to the log file.
# One of 'garbage', 'trace', 'debug', info', 'warning', 'error', 'critical'.
log_level_logfile: info

file_roots:
  base:
    - /root/devops/salt
    - /root/devops/lib
    - /root/devops/configs
    - /root/ssl
    # Below here are configs not stored in git
    - /root/environment/configs

pillar_roots:
  base:
    - /root/devops/pillar

    # Below here are configs not stored in git
    - /root/environment/pillar

timeout: 120
ping_on_rotate: True
show_timeout: True
gather_job_timeout: 5
worker_threads: 164
sock_pool_size: 15
root@prod-salt:~# salt --versions-report
Salt Version:
           Salt: 2015.8.7

Dependency Versions:
         Jinja2: 2.10.3
       M2Crypto: 0.21.1
           Mako: Not Installed
         PyYAML: 3.13
          PyZMQ: 14.5.0
         Python: 2.7.3 (default, Jun 22 2015, 19:33:41)
           RAET: Not Installed
        Tornado: 4.3
            ZMQ: 4.0.5
           cffi: 0.8.6
       cherrypy: Not Installed
       dateutil: 2.7.5
          gitdb: Not Installed
      gitpython: 0.1.6
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.1.10
   mysql-python: 1.2.3
      pycparser: 2.19
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: Ubuntu 12.04 precise
        machine: x86_64
        release: 3.2.0-77-virtual
         system: Ubuntu 12.04 precise

- Error message from the salt-master

2019-12-26 07:03:26,480 [salt.transport.mixins.auth][INFO    ][3884] Authentication accepted from prd-0773f92cc92e9b4d2
2019-12-26 07:03:28,652 [salt.utils.job   ][INFO    ][2187] Got return from prd-0773f92cc92e9b4d2 for job 20191226070328648332
2019-12-26 07:03:29,725 [salt.transport.mixins.auth][INFO    ][3887] Authentication request from prd-0773f92cc92e9b4d2
2019-12-26 07:03:29,729 [salt.transport.mixins.auth][INFO    ][3887] Authentication accepted from prd-0773f92cc92e9b4d2
2019-12-26 07:03:36,566 [salt.master      ][ERROR   ][3906] Error in function _mine:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/master.py", line 1379, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.7/dist-packages/salt/master.py", line 1007, in _mine
    load = self.__verify_load(load, ('id', 'data', 'tok'))
  File "/usr/lib/python2.7/dist-packages/salt/master.py", line 924, in __verify_load
    if not self.__verify_minion(load['id'], load['tok']):
  File "/usr/lib/python2.7/dist-packages/salt/master.py", line 824, in __verify_minion
    with salt.utils.fopen(pub_path, 'r') as fp_:
  File "/usr/lib/python2.7/dist-packages/salt/utils/__init__.py", line 1204, in fopen
    fhandle = open(*args, **kwargs)
IOError: [Errno 2] No such file or directory: '/etc/salt/pki/master/minions/prd-d-g-081dc5ca9cc6352c4'

- Error message from the salt client node itself

Jan 02 18:09:44 [ERROR   ] An un-handled exception was caught by salt's global exception handler:
    Jan 02 18:09:44 SaltReqTimeoutError: Message timed out
    Jan 02 18:09:44 Traceback (most recent call last):
    Jan 02 18:09:44   File "/usr/bin/salt-call", line 11, in <module>
    Jan 02 18:09:44     salt_call()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 338, in salt_call
    Jan 02 18:09:44     client.run()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/cli/call.py", line 53, in run
    Jan 02 18:09:44     caller.run()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/cli/caller.py", line 134, in run
    Jan 02 18:09:44     ret = self.call()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/cli/caller.py", line 197, in call
    Jan 02 18:09:44     ret['return'] = func(*args, **kwargs)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 887, in sls
    Jan 02 18:09:44     high_, errors = st_.render_highstate({saltenv: mods})
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/state.py", line 2981, in render_highstate
    Jan 02 18:09:44     sls, saltenv, mods, matches)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/state.py", line 2651, in render_state
    Jan 02 18:09:44     state_data = self.client.get_state(sls, saltenv)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 460, in get_state
    Jan 02 18:09:44     dest = self.cache_file(path, saltenv, cachedir=cachedir)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 165, in cache_file
    Jan 02 18:09:44     return self.get_url(path, '', True, saltenv, cachedir=cachedir)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 560, in get_url
    Jan 02 18:09:44     url, dest, makedirs, saltenv, cachedir=cachedir)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 1046, in get_file
    Jan 02 18:09:44     hash_server = self.hash_file(path, saltenv)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 1298, in hash_file
    Jan 02 18:09:44     return self.channel.send(load)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/utils/async.py", line 74, in wrap
    Jan 02 18:09:44     ret = self._block_future(ret)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/utils/async.py", line 84, in _block_future
    Jan 02 18:09:44     return future.result()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    Jan 02 18:09:44     raise_exc_info(self._exc_info)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    Jan 02 18:09:44     yielded = self.gen.throw(*exc_info)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 232, in send
    Jan 02 18:09:44     ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    Jan 02 18:09:44     value = future.result()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    Jan 02 18:09:44     raise_exc_info(self._exc_info)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    Jan 02 18:09:44     yielded = self.gen.throw(*exc_info)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 200, in _crypted_transfer
    Jan 02 18:09:44     ret = yield _do_transfer()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    Jan 02 18:09:44     value = future.result()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    Jan 02 18:09:44     raise_exc_info(self._exc_info)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    Jan 02 18:09:44     yielded = self.gen.throw(*exc_info)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 186, in _do_transfer
    Jan 02 18:09:44     tries=tries,
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    Jan 02 18:09:44     value = future.result()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    Jan 02 18:09:44     raise_exc_info(self._exc_info)
    Jan 02 18:09:44   File "<string>", line 3, in raise_exc_info
    Jan 02 18:09:44 SaltReqTimeoutError: Message timed out
    Jan 02 18:09:44 Traceback (most recent call last):
    Jan 02 18:09:44   File "/usr/bin/salt-call", line 11, in <module>
    Jan 02 18:09:44     salt_call()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 338, in salt_call
    Jan 02 18:09:44     client.run()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/cli/call.py", line 53, in run
    Jan 02 18:09:44     caller.run()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/cli/caller.py", line 134, in run
    Jan 02 18:09:44     ret = self.call()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/cli/caller.py", line 197, in call
    Jan 02 18:09:44     ret['return'] = func(*args, **kwargs)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 887, in sls
    Jan 02 18:09:44     high_, errors = st_.render_highstate({saltenv: mods})
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/state.py", line 2981, in render_highstate
    Jan 02 18:09:44     sls, saltenv, mods, matches)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/state.py", line 2651, in render_state
    Jan 02 18:09:44     state_data = self.client.get_state(sls, saltenv)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 460, in get_state
    Jan 02 18:09:44     dest = self.cache_file(path, saltenv, cachedir=cachedir)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 165, in cache_file
    Jan 02 18:09:44     return self.get_url(path, '', True, saltenv, cachedir=cachedir)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 560, in get_url
    Jan 02 18:09:44     url, dest, makedirs, saltenv, cachedir=cachedir)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 1046, in get_file
    Jan 02 18:09:44     hash_server = self.hash_file(path, saltenv)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/fileclient.py", line 1298, in hash_file
    Jan 02 18:09:44     return self.channel.send(load)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/utils/async.py", line 74, in wrap
    Jan 02 18:09:44     ret = self._block_future(ret)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/utils/async.py", line 84, in _block_future
    Jan 02 18:09:44     return future.result()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    Jan 02 18:09:44     raise_exc_info(self._exc_info)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    Jan 02 18:09:44     yielded = self.gen.throw(*exc_info)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 232, in send
    Jan 02 18:09:44     ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    Jan 02 18:09:44     value = future.result()
    Jan 02 18:09:44   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    Jan 02 18:09:51     raise_exc_info(self._exc_info)
    Jan 02 18:09:51   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    Jan 02 18:09:51     yielded = self.gen.throw(*exc_info)
    Jan 02 18:09:51   File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 200, in _crypted_transfer
    Jan 02 18:09:51     ret = yield _do_transfer()
    Jan 02 18:09:51   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    Jan 02 18:09:51     value = future.result()
    Jan 02 18:09:51   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    Jan 02 18:09:51     raise_exc_info(self._exc_info)
    Jan 02 18:09:51   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    Jan 02 18:09:51     yielded = self.gen.throw(*exc_info)
    Jan 02 18:09:51   File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 186, in _do_transfer
    Jan 02 18:09:51     tries=tries,
    Jan 02 18:09:51   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    Jan 02 18:09:51     value = future.result()
    Jan 02 18:09:51   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    Jan 02 18:09:51     raise_exc_info(self._exc_info)
    Jan 02 18:09:51   File "<string>", line 3, in raise_exc_info
    Jan 02 18:09:51 salt.exceptions.SaltReqTimeoutError: Message timed out

@zbarr
Copy link
Author

zbarr commented Jan 3, 2020

@harishkapri I think you may have a different issue than I had. It appears to me that your client is raising a timeout error because the master can't authenticate that minion.

@mattp-
Copy link
Contributor

mattp- commented Jan 16, 2020

@harishkapri that is a both a very very large worker_threads, but also a very old version of salt. I'd try changing both.
I got an email from @BugMing about an issue wrt grains refresh on schedule receiving timeouts, but for some reason I can't see it here anymore, maybe it was deleted. Either way, you should use minion data cache for that purpose not a scheduled job to store grains to database. and increase worker_threads in general for handling large scale jobs like that.

@mattp-
Copy link
Contributor

mattp- commented Jan 16, 2020

ah, thats you @zbarr. use minion data cache, not what you're trying to do now.

@mattp-
Copy link
Contributor

mattp- commented Jan 16, 2020

@liu624862560 you'd want to increase your worker_threads, and probably batch the requests such that it doesnt all happen at once. but again, you can use minion data cache to do what youre doing automatically instead of scheduling it yourself.

@harishkapri
Copy link

harishkapri commented Jan 16, 2020

@mattp- I will go ahead and increase the worker_threads and restart the salt-master service and verify. I was under the impression that worker threads was always dependent on the salt master version. So, even if we increase it, I believe there is limit on it by the specific salt master version. Please correct me, If I am wrong.
worker_threads: 262

@mattp-
Copy link
Contributor

mattp- commented Jan 16, 2020 via email

@harishkapri
Copy link

harishkapri commented Jan 17, 2020

We are running salt master on AWS EC2 Instance type m4.2xlarge
The reason we have set it so high because salt master keeps timing out during our production deployments and when it happens, most of the nodes fail to bootstrap and hence need to re-deploy.
Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased.
When the salt timing out happens, we clear the cache and restart the salt master service to recover from it and then proceed like in batches of deployment which hampers our single click deployments as salt is the bottleneck.


root@prd-sjc-salt-master:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2300.034
BogoMIPS:              4603.81
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7

@EvanKuhn-Affirm
Copy link

@zbarr I'm coming to this a little late, but to confirm, did setting the SOCK_POOL_SIZE larger fix your error? We are seeing the same types of errors and I suspect this config option is what we need to change.

@zbarr
Copy link
Author

zbarr commented Jun 8, 2020

My sock_pool_size is set to 64 currently and my worker_threads are set to 64 as well. We still have these issues occasionally with the timeouts but for one of my salt masters, restarting the service mitigates the issue for at least a week. As we are still on the 2018.3.3 master, we'll be pushing towards and upgrade soon anyway. We're seeing some memory leak issues that are documented here so we're due for a refresh anyway.

@saltysaltsalty
Copy link

Is there anyone on salt 3000 that is experiencing this problem? I'm currently on salt 2019.2.5 and seeing these sporadic errors.

TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Feb 17, 2021
Time to time salt-master get overloaded because he receive to much
query, for example during upgrade and one environment a bit slow some
salt states may timeout and make the upgrade fail.
To avoid that kind of issue just bump the `sock_pool_size` on salt
master (from 1 to 15) to avoid blocking waiting for zeromq communications
and also bump the `worker_threads` on salt master (from 5 to 10) to
avoid some failure if you have too many communication with the salt
master (e.g.: because of upgrade + storage operator)

Sees: saltstack/salt#53147
TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Feb 17, 2021
Time to time salt-master get overloaded because he receive to much
query, for example during upgrade and one environment a bit slow some
salt states may timeout and make the upgrade fail.
To avoid that kind of issue just bump the `sock_pool_size` on salt
master (from 1 to 15) to avoid blocking waiting for zeromq communications
and also bump the `worker_threads` on salt master (from 5 to 10) to
avoid some failure if you have too many communication with the salt
master (e.g.: because of upgrade + storage operator)

Sees: saltstack/salt#53147
TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Feb 17, 2021
Time to time salt-master get overloaded because he receive to much
query, for example during upgrade and one environment a bit slow some
salt states may timeout and make the upgrade fail.
To avoid that kind of issue just bump the `sock_pool_size` on salt
master (from 1 to 15) to avoid blocking waiting for zeromq communications
and also bump the `worker_threads` on salt master (from 5 to 10) to
avoid some failure if you have too many communication with the salt
master (e.g.: because of upgrade + storage operator)

Sees: saltstack/salt#53147
TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Apr 8, 2021
Time to time salt-master get overloaded because he receive to much
query, for example during upgrade and one environment a bit slow some
salt states may timeout and make the upgrade fail.
To avoid that kind of issue just bump the `sock_pool_size` on salt
master (from 1 to 15) to avoid blocking waiting for zeromq communications
and also bump the `worker_threads` on salt master (from 5 to 10) to
avoid some failure if you have too many communication with the salt
master (e.g.: because of upgrade + storage operator)

Sees: saltstack/salt#53147
(cherry picked from commit 21d679a)
TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Apr 8, 2021
Time to time salt-master get overloaded because he receive to much
query, for example during upgrade and one environment a bit slow some
salt states may timeout and make the upgrade fail.
To avoid that kind of issue just bump the `sock_pool_size` on salt
master (from 1 to 15) to avoid blocking waiting for zeromq communications
and also bump the `worker_threads` on salt master (from 5 to 10) to
avoid some failure if you have too many communication with the salt
master (e.g.: because of upgrade + storage operator)

Sees: saltstack/salt#53147
(cherry picked from commit 21d679a)
@zbarr
Copy link
Author

zbarr commented Apr 13, 2021

@saltysaltsalty I am still seeing this issue on 3002.2.

@raddessi
Copy link
Contributor

Still seeing this on 3004

@nick-stocks
Copy link

+1 on 3004. With only 3-5 minions too

@Oddly
Copy link

Oddly commented Dec 12, 2022

Same here, minion version 3000 (SLES12.5 - 3000-71.2) and Salt master 3004.2.

@jbartak
Copy link

jbartak commented Jun 14, 2023

Same on 3005.1 version :(.
Minions: 550

Master server: 16CPU, 16GB RAM.

@darkpixel
Copy link
Contributor

Still occurring on 3006.1 master with 3006.1 minions.

It's a 8-core machine with 16 GB RAM.
We have about 500 minions and the load average is typically around 1.2.
worker_threads is set to 12 and sock_pool_size set to 15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Grains Salt-API severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests