-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
master auth delay #49586
Comments
This is not typical performance but, of course, Salt's gitfs related to the performance of the git server itself. Do you have large objects in your repo? There are pretty well-known performance problems when serving large objects out of a git repo. |
Hi @cachedout , thanks for the answer, define large objects ? We do have some old war, the whole repo is about 2Go, is that too big ? |
setting trace instead of debug might show more information if there is inactivity and debug is not showing anything just fyi. look forward to the logs :) |
Hi, If you got any clues what causes such a big difference between gitfs and non-gitfs. |
thanks for the logs. i'm not seeing anything that jumps out at me so I'm wondering if it is the war files/2GB size of the repo. ping @terminalmage do you know if that sort of size could slow things down? The original question was "We do have some old war, the whole repo is about 2Go, is that too big ?" |
Unfortunately, that doesn't really tell us anything about fileclient requests, for that you'd need some logging from the minion. Trace logging is overkill, debug is fine. Please add the following to
Next, login to the minion and run |
Here we go :
|
I'm sorry, I mistyped above. Instead of setting Could you repeat the above after setting |
Yep I saw that but I guessed you were looking for something specific in that log, anyway here the time stamped log :
I can see that 90% of the required time is about waiting for the connection to the master 4506 port... The ping to the master is not bad :
A telnet o this port is instantaneous... Any suggestion ? |
It looks like this may have nothing to do with the fileserver. Can you try a couple |
Hi, There is definitely not the same delays involved here, see :
And :
|
OK, this does seem to be specific to fileclient requests. And it doesn't seem to be a connection latency issue either, since the Can you please up the verbosity in the master log? You can do this by setting The master log file should already include timestamps, its default log format matches the one I asked you to set for the |
Here we go : |
OK, according to the log, the job was published to the minion at 19:46:04, but the minion didn't even start running states until 19:47:05, almost exactly one minute later. Once started, they took just over 1 second to run (19:47:05.288697 to 19:47:06.426968), which suggests that fileclient requests are being fulfilled quickly and are not what is causing the delay. Can you bump up the loglevel on the minion and repeat the same test, and then check I know I had you try
This will use one of our test states which doesn't actually do anything (and is used simply to test the state compiler). In the past we have had issues with certain grains causing a timeout and delaying jobs. Are you using any custom-written grains, particularly ones which make outbound network connections? |
Wow that's a lot of log to handle and I am not sure where to search for a job starting (either on master or minion), do you mind checking it ? Cause I really don't want to mess up the diagnosis with a wrong analysis ... Let's first start with the very same state.apply : Now the test state you asked for : Now regarding your last question : Nope, no house written grains here, I am defining custom grains for some minions but they contains only strings or int , they are totally simple and defined inside /etc/salt/grains file on minions .... there is no custom grains at all on the server on which the above tests were ran. Hope you'll get the info you needed, thanks again for your time. |
OK, the delay appears to come after the minion auths to the master, but it's unclear if the auth is what is hanging. When we see delays like this, they are usually related to grains in some way. In the past we have had to remove grains that made outbound connections as any latency in connecting to those hosts will delay the job as grains are unable to finish. If you could run a salt-call using |
Here we go (sorry for the delay) : Thank you @terminalmage |
@cavepopo OK, it's definitely not the grains, and the trace logging shows that we did complete the auth. Given that we loaded the minion functions right after the delay ended, the next place I would like to look is when the minion requests Pillar data as it runs states. Would you be able to find salt/state.py on the minion (likely somewhere like log.debug('Gathering pillar data for state run')
if initial_pillar and not self._pillar_override:
self.opts['pillar'] = initial_pillar
else:
# Compile pillar data
self.opts['pillar'] = self._gather_pillar()
# Reapply overrides on top of compiled pillar
if self._pillar_override:
self.opts['pillar'] = salt.utils.dictupdate.merge(
self.opts['pillar'],
self._pillar_override,
self.opts.get('pillar_source_merging_strategy', 'smart'),
self.opts.get('renderer', 'yaml'),
self.opts.get('pillar_merge_lists', False))
log.debug('Finished gathering pillar data for state run') Once you do this, if you can repeat that salt-call, with debug-level logging this time, this will help tell us if the delay happens waiting for pillar data from the Master. Thanks for continuing to bear with me as we troubleshoot, I unfortunately can't reproduce this behavior on my end. |
Here, the minion log (full log https://gist.github.com/cavepopo/6c9c2e67a556655a4621222bc061b64d) :
Comparing this one to its master counterpart shows that the whole minute delay (bw 11:10:29 and 11:10:31) is spent running gitfs related operation, mostly :
|
The master stuff is just the update thread fetching, that has nothing to do with pillar. What that debug logging shows though is that the pillar compilation inside the state run is not what is causing the delay, since it happens after the delay. OK, more debug logging (you can remove the stuff from |
Here we go :
|
@cavepopo We are getting closer now. Since we know that the delay is happening during the NOTE: Github is in the midst of a major outage right now, so the gist I created is currently showing a 404. It should show up once they recover from the outage. |
Bumping this comment thread just in case no email notification went out, since my last comment to @cavepopo was submitted during their (Github's) outage yesterday. |
Here :
|
Sorry for the delay, I was on vacation. I will look at this soon. |
No worries ! Hope you enjoyed your vacation :) |
and if you switch to gitpython do you still see the issue? Im wondering if these memory leaks are the cause to slowing down teh performance |
The memory leaks are certainly having an impact on performance but it isn't clear if they are the cause of the performance issues with auth. It looks like we are using gitpython now, as using pygit made everything much worse. |
can you reporrt back whether you are not seeing the auth issues after a while using gitpython or not? |
We do see minion auth delays with gitpython. |
and performance is better or no? |
After restarting the masters it Initially it seems improved but then degrades, until we have to restart them again. |
interesting. i thought this was specific to pygit2. Can you share a sanitized version of your configurations? i'll try to replicate |
I'm also having a problem that this thread seems to fit with. I've got the 3000 version of master/minion with Python 2. Master is RHEL7 and I'm only running against a single windows minion. Gitfs is running pygit2. I've changed a variety of configuration information on the master trying to help, but I've set everything back to a basic install, except the gits entries needed to make it work. Haven't touched any configuration on the minion. |
I thinks this is the "git file handler equivalent" of my found bug:
Maybe you can test this by checking out / exporting the git repository instead URL / some "dummy" data in this folder to see if there are response time changes. So far I remember I couldn't log even in debug mode that salt-master was scanning the whole folder structure (maybe I got this with lsof | grep /path/to/mountpoint) but mount/unount of this folders showed the change clearly. |
I've somewhat already done this. We added the entire Visual Studio Build Tools network layout (like 15 GB and no telling how many files) to the local filesystem on the master and didn't see any noticeable performance difference. And I'm the same way with the logs, even in trace I'm not seeing anything that gives me an idea of what's going on. |
In my case it was a gluster share filesystem I mounted into one subfolder so it seems to be commonly a non-local filesystem problem... |
@RobHayes512 sounds similar to some of my own issues in #55419. I am not sure though that this is the same problem that @cavepopo has documented in THIS issue. His seems different somehow and I don't have the skill to find the difference. When you run a When using gitfs + pygit2, But .. something else I've noticed is that the way the file list cache locking appears to work, it seems like multiple master worker threads can be - independently - working on rebuilding the file list cache, so you still wind up with a degree of thundering herd against your master. Bottom line: If you're using gitfs+pygit2, switch to gitfs+gitpython. If that makes your state applies faster, then your problem is #55419. If not, it could be this issue or something else entirely. |
I tried gitpython yesterday, but we’re using something authentication-wise that isn’t available in gitpyhon. |
Reviewed in Grooming 2012-JUN-21 and we believe this PR may address the leaks: #60386 and would like someone here try this PR as a fix. |
@sagetherage #60386 will address leaks but won't resolve the performance issues referenced in this issue. |
@dwoz yes, only the leaks, we left this as still in grooming since the original report is on a now unsupported version of Salt the next steps would be to also attempt to address performance - not closing at this time. |
Description of Issue/Question
using gitfs (pygit based) for states and external pillar, we are experiencing major performance drawback, the ratio from locally stored states and pillar and a gitfs based storage is about x5 (talking about time to complete setup). Local storage taking about 7/8 minutes when gitfs based master takes about 45 minutes to achieve the exact same setup. :(
for now I am just wondering (without real doubt though) if I am the only one to experience this ? I am quite sure I am, but I wanted to be sure first. Then I will be adding debug log and all this in this issue for tracking root cause.
So : Anyone experiencing this kind of performance using gitfs for state and external pillar ?
Steps to Reproduce Issue
(Include debug logs if possible and relevant.)
Versions Report
(Provided by running
salt --versions-report
. Please also mention any differences in master/minion versions.)master / minion in latest (2018.3.2) as :
The text was updated successfully, but these errors were encountered: