-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Salt Tornado API: salt.exceptions.SaltReqTimeoutError: Message timed out #53147
Comments
For further detail, we have just over 500 minions in our infrastructure. |
This is ongoing. I've increased the worker threads to 128 from 64 and I've changed the API wrapper to use a lookup_jid instead of exit_success to lighten the load on the salt minions. I also received a longer and more detailed error recently also:
|
@dwoz Any ideas here? |
@zbarr what config are you running? sock_pool_size can mitigate this among other things and is likely what you're seeing. |
Trying that now. Based on the documentation, this may be exactly what I need: SOCK_POOL_SIZE To avoid blocking waiting while writing a data to a socket, we support socket pool for Salt applications. For example, a job with a large number of target host list can cause long period blocking waiting. The option is used by ZMQ and TCP transports, and the other transport methods don't need the socket pool by definition. Most of Salt tools, including CLI, are enough to use a single bucket of socket pool. On the other hands, it is highly recommended to set the size of socket pool larger than 1 for other Salt applications, especially Salt API, which must write data to socket concurrently. sock_pool_size: 15 However, if this works, I am pretty disappointed that it wasn't mentioned in the documentation for rest_tornado. Based on how unreliable my setup was, this should be a requirement to run the salt API. |
Hi, recently we too are experiencing similar issues in our production single salt master environment. I have exhausted all the resources and no resolution found anywhere.
salt-master server settings
- Error message from the salt-master
- Error message from the salt client node itself
|
@harishkapri I think you may have a different issue than I had. It appears to me that your client is raising a timeout error because the master can't authenticate that minion. |
@harishkapri that is a both a very very large worker_threads, but also a very old version of salt. I'd try changing both. |
ah, thats you @zbarr. use minion data cache, not what you're trying to do now. |
@liu624862560 you'd want to increase your worker_threads, and probably batch the requests such that it doesnt all happen at once. but again, you can use minion data cache to do what youre doing automatically instead of scheduling it yourself. |
@mattp- I will go ahead and increase the worker_threads and restart the salt-master service and verify. I was under the impression that worker threads was always dependent on the salt master version. So, even if we increase it, I believe there is limit on it by the specific salt master version. Please correct me, If I am wrong. |
Harish,
I’d recommend you actually reduce worker threads to some factor of your cpu
cores (even 1x). 250 threads is waaaay too high unless you have some
monster server.
…On Thu, Jan 16, 2020 at 3:10 PM Harish Kapri ***@***.***> wrote:
@mattp- <https://github.com/mattp-> I will go ahead and increase the
worker_threads and restart the salt-master service and verify. I was under
the impression that worker threads was always dependent on the salt master
version.
worker_threads: 262
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53147?email_source=notifications&email_token=AAAK2VWXCNELYJQ2SQKVZUTQ6C5MHA5CNFSM4HOKHNY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJFMLLA#issuecomment-575325612>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAK2VRF5DMWRZFHBEPVLITQ6C5MHANCNFSM4HOKHNYQ>
.
|
We are running salt master on AWS EC2 Instance type m4.2xlarge
|
@zbarr I'm coming to this a little late, but to confirm, did setting the SOCK_POOL_SIZE larger fix your error? We are seeing the same types of errors and I suspect this config option is what we need to change. |
My sock_pool_size is set to 64 currently and my worker_threads are set to 64 as well. We still have these issues occasionally with the timeouts but for one of my salt masters, restarting the service mitigates the issue for at least a week. As we are still on the 2018.3.3 master, we'll be pushing towards and upgrade soon anyway. We're seeing some memory leak issues that are documented here so we're due for a refresh anyway. |
Is there anyone on salt 3000 that is experiencing this problem? I'm currently on salt 2019.2.5 and seeing these sporadic errors. |
Time to time salt-master get overloaded because he receive to much query, for example during upgrade and one environment a bit slow some salt states may timeout and make the upgrade fail. To avoid that kind of issue just bump the `sock_pool_size` on salt master (from 1 to 15) to avoid blocking waiting for zeromq communications and also bump the `worker_threads` on salt master (from 5 to 10) to avoid some failure if you have too many communication with the salt master (e.g.: because of upgrade + storage operator) Sees: saltstack/salt#53147
Time to time salt-master get overloaded because he receive to much query, for example during upgrade and one environment a bit slow some salt states may timeout and make the upgrade fail. To avoid that kind of issue just bump the `sock_pool_size` on salt master (from 1 to 15) to avoid blocking waiting for zeromq communications and also bump the `worker_threads` on salt master (from 5 to 10) to avoid some failure if you have too many communication with the salt master (e.g.: because of upgrade + storage operator) Sees: saltstack/salt#53147
Time to time salt-master get overloaded because he receive to much query, for example during upgrade and one environment a bit slow some salt states may timeout and make the upgrade fail. To avoid that kind of issue just bump the `sock_pool_size` on salt master (from 1 to 15) to avoid blocking waiting for zeromq communications and also bump the `worker_threads` on salt master (from 5 to 10) to avoid some failure if you have too many communication with the salt master (e.g.: because of upgrade + storage operator) Sees: saltstack/salt#53147
Time to time salt-master get overloaded because he receive to much query, for example during upgrade and one environment a bit slow some salt states may timeout and make the upgrade fail. To avoid that kind of issue just bump the `sock_pool_size` on salt master (from 1 to 15) to avoid blocking waiting for zeromq communications and also bump the `worker_threads` on salt master (from 5 to 10) to avoid some failure if you have too many communication with the salt master (e.g.: because of upgrade + storage operator) Sees: saltstack/salt#53147 (cherry picked from commit 21d679a)
Time to time salt-master get overloaded because he receive to much query, for example during upgrade and one environment a bit slow some salt states may timeout and make the upgrade fail. To avoid that kind of issue just bump the `sock_pool_size` on salt master (from 1 to 15) to avoid blocking waiting for zeromq communications and also bump the `worker_threads` on salt master (from 5 to 10) to avoid some failure if you have too many communication with the salt master (e.g.: because of upgrade + storage operator) Sees: saltstack/salt#53147 (cherry picked from commit 21d679a)
@saltysaltsalty I am still seeing this issue on 3002.2. |
Still seeing this on 3004 |
+1 on 3004. With only 3-5 minions too |
Same here, minion version 3000 (SLES12.5 - 3000-71.2) and Salt master 3004.2. |
Same on Master server: 16CPU, 16GB RAM. |
Still occurring on It's a 8-core machine with 16 GB RAM. |
Description of Issue/Question
We just started using the salt API a few months ago and we cannot fix this issue. Currently, we are using Pepper to make the requests. I have created an "async" wrapper around it which runs a local_async command, polls with the jobs.exit_success runner for 30 seconds to a minute, then does a lookup_jid to get the results. Currently, the only thing we use this for is to retrieve all of the grains from the minions and store them in a database (just running grains.items non-batch mode). The output below does not contain the salt API logs from the successful local_async command and successful exit_success runners. It only contains the one that eventually fails.
When using the tornado API, about 2/3 of my "jobs" fail with the following message (trace logging on):
Setup
(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)
We have a 3 master multi-master setup. The master which this is running on is a 16 core 128 GB RAM physical host (plenty of hardware).
Steps to Reproduce Issue
(Include debug logs if possible and relevant.)
See description of issue. May be complicated to reproduce but I can hand over more code if needed.
Versions Report
(Provided by running
salt --versions-report
. Please also mention any differences in master/minion versions.)Please advise. Thanks!
The text was updated successfully, but these errors were encountered: