Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in Salt loader. #45782

Closed
wants to merge 1 commit into from

Conversation

smarsching
Copy link
Contributor

@smarsching smarsching commented Jan 30, 2018

What does this PR do?

This PR is a new version of #39948, so the description has been copied from there. The only difference to the original PR is that this PR is targeted at the oxygen branch instead of the develop branch.

There was a race condition in the salt loader when injecting global values (e.g. "pillar" or "salt") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously.

There have been earlier attempts to fix this problem (#27937, #29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules.

These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on #39670.

This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change.

This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.

What issues does this PR fix or reference?

This PR fixes problems that surfaced when developing the parallel runners feature (#39670), but is also related to problems reported earlier (#23373). The problems described in #29028 might also be related, though this needs further confirmation.

Previous Behavior

Changes to pillar data in one thread might have influenced pillar data in a different thread. Whether or when this problem actually appeared was unpredictable (due to the nature of a race condition).

New Behavior

Changes to pillar data in one thread will never affect the pillar data in a different thread.

Tests written?

No

Regression Potential

The change to the loader code itself is rather small, but it affects a central component. So if there is a problem, the impact might affect nearly all components.

@smarsching smarsching requested a review from a team as a code owner January 30, 2018 15:55
Copy link
Contributor

@DmitryKuzmenko DmitryKuzmenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!
Just one minor change in doc.

Proxy object that can reference different values depending on the current
thread of execution.

..versionadded:: Nitrogen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oxygen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This simply points to how old origins of this PR are. 😃 I fixed it.

@gtmanfred
Copy link
Contributor

Does this pr need to be pointed at oxygen.rc1?

@rallytime
Copy link
Contributor

@gtmanfred No, when I talked to mike about this, we decided to have this go into oxygen for the .1 release since we're hardening the branch right now.

@gtmanfred
Copy link
Contributor

cool 👍 sounds good

@gtmanfred
Copy link
Contributor

@smarsching would you mind rebasing this? it looks like it should be pretty simple, just was adding encoding to the msgpack stuff in salt.utils.cloud

@smarsching smarsching force-pushed the thread-safe-loader-2 branch from c95c3eb to 2b5cbd7 Compare February 3, 2018 09:55
@smarsching
Copy link
Contributor Author

@gtmanfred I rebased it to the newest version of the oxygen branch and used the chance to squash the two commits, so that now there is only one single and clean commit.

@cachedout
Copy link
Contributor

Thus far I am OK with this but I want to sit down with @thatch45 and discuss this line-by-line before I approve this. These are substantial changes but overall, I think it looks pretty good.

@@ -97,14 +97,17 @@
try:
# Attempt to import msgpack
import msgpack
import salt.utils.msgpack
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why import if we're not using it?

@@ -456,7 +459,7 @@ def _make_packet(self, label, timestamp, data):
packet = (tag, timestamp, data)
if self.verbose:
print(packet)
return msgpack.packb(packet)
return salt.utils.msgpack.packb(packet, _msgpack_module=msgpack)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we are using it here...

@rallytime rallytime changed the base branch from oxygen to 2018.3 February 23, 2018 13:54
dincamihai pushed a commit to openSUSE/salt-packaging that referenced this pull request Mar 2, 2018
dincamihai pushed a commit to openSUSE/salt-packaging that referenced this pull request Mar 2, 2018
@mbologna
Copy link
Contributor

mbologna commented Mar 2, 2018

While setting up a salt-minion with this patch applied we encountered the following errors:

 (remote-exec): File "/usr/lib/python2.7/site-packages/salt/utils/thread_local_proxy.py", line 319, in __nonzero__
 (remote-exec): return bool(reference)
 (remote-exec): File "/usr/lib/python2.7/site-packages/salt/utils/thread_local_proxy.py", line 319, in __nonzero__
 (remote-exec): return bool(reference)
 (remote-exec): File "/usr/lib/python2.7/site-packages/salt/utils/thread_local_proxy.py", line 319, in __nonzero__
 (remote-exec): return bool(reference)
 (remote-exec): File "/usr/lib/python2.7/site-packages/salt/utils/thread_local_proxy.py", line 318, in __nonzero__
 (remote-exec): reference = ThreadLocalProxy.get_reference(self)
 (remote-exec): File "/usr/lib/python2.7/site-packages/salt/utils/thread_local_proxy.py", line 69, in get_reference
 (remote-exec): thread_local = object.__getattribute__(proxy, '_thread_local')
 (remote-exec): RuntimeError: maximum recursion depth exceeded while calling a Python object

@smarsching
Copy link
Contributor Author

@mbologna Good catch! I am not sure why this problem did not surface during my tests (maybe because I tested them with an older Salt version).

I have pushed two further commits. The first one fixes the issue that you found, the second one fixes a related issue. I am not sure whether the second one has any practical significance (it is quite possible that having such a recursive data structure would lead to the same problem in the serialization code), but I think it was still worth fixing.

I also included unit tests for both issues in order to ensure that they are actually fixed.

Could you please test it again with the revised patch?

@rallytime
Copy link
Contributor

@smarsching Ah, this needs a rebase. Can you fix that? Looks like there might be a lint error, too, but the build has timed out.

Once this is rebased and errors are fixed, we can see if @cachedout and @thatch45 can go through this one last time. :)

@smarsching smarsching force-pushed the thread-safe-loader-2 branch 2 times, most recently from e041d82 to 18b8162 Compare March 9, 2018 10:33
@smarsching
Copy link
Contributor Author

@rallytime I rebased the branch and fixed the lint problems.

@rallytime
Copy link
Contributor

re-run ubuntu

@rallytime
Copy link
Contributor

@smarsching Looks like something isn't quite right here. I am seeing this stacktrace right when the test daemons start up:

20:10:49  * Starting salt-master ... 20:10:48,977 [salt.log.setup                           :1165][ERROR   ] An un-handled exception was caught by salt's global exception handler:
20:10:49 AttributeError: 'list' object has no attribute 'iteritems'
20:10:49 Traceback (most recent call last):
20:10:49   File "/tmp/salt-tests-tmpdir/scripts/cli_run.py", line 27, in <module>
20:10:49     salt_run()
20:10:49   File "/testing/salt/scripts.py", line 412, in salt_run
20:10:49     client.run()
20:10:49   File "/testing/salt/cli/run.py", line 40, in run
20:10:49     ret = runner.run()
20:10:49   File "/testing/salt/runner.py", line 271, in run
20:10:49     daemonize=False)
20:10:49   File "/testing/salt/client/mixins.py", line 485, in _proc_function
20:10:49     return self.low(fun, low, full_return=False)
20:10:49   File "/testing/salt/client/mixins.py", line 261, in low
20:10:49     return self._low(fun, low, print_event=print_event, full_return=full_return)
20:10:49   File "/testing/salt/client/mixins.py", line 418, in _low
20:10:49     mminion=self.mminion,
20:10:49   File "/testing/salt/utils/job.py", line 107, in store_job
20:10:49     mminion.returners[savefstr](load['jid'], load)
20:10:49   File "/testing/salt/returners/local_cache.py", line 215, in save_load
20:10:49     serial.dump(clear_load, wfh)
20:10:49   File "/testing/salt/payload.py", line 263, in dump
20:10:49     fn_.write(self.dumps(msg))
20:10:49   File "/testing/salt/payload.py", line 225, in dumps
20:10:49     _msgpack_module=msgpack)
20:10:49   File "/testing/salt/utils/msgpack.py", line 48, in packb
20:10:49     o = salt.utils.thread_local_proxy.ThreadLocalProxy.unproxy_recursive(o)
20:10:49   File "/testing/salt/utils/thread_local_proxy.py", line 180, in unproxy_recursive
20:10:49     return ThreadLocalProxy._unproxy_recursive(obj)
20:10:49   File "/testing/salt/utils/thread_local_proxy.py", line 215, in _unproxy_recursive
20:10:49     or ThreadLocalProxy._needs_unproxy_recursive(obj)):
20:10:49   File "/testing/salt/utils/thread_local_proxy.py", line 290, in _needs_unproxy_recursive
20:10:49     for key, value in six.iteritems(obj))
20:10:49   File "/testing/salt/utils/thread_local_proxy.py", line 290, in <genexpr>
20:10:49     for key, value in six.iteritems(obj))
20:10:49   File "/testing/salt/utils/thread_local_proxy.py", line 290, in _needs_unproxy_recursive
20:10:49     for key, value in six.iteritems(obj))
20:10:49   File "/testing/salt/utils/thread_local_proxy.py", line 290, in <genexpr>
20:10:49     for key, value in six.iteritems(obj))
20:10:49   File "/testing/salt/utils/thread_local_proxy.py", line 298, in _needs_unproxy_recursive
20:10:49     for key, value in six.iteritems(obj))
20:10:49   File "/testing/salt/ext/six.py", line 606, in iteritems
20:10:49     return d.iteritems(**kw)
20:10:49 AttributeError: 'list' object has no attribute 'iteritems'

@smarsching smarsching force-pushed the thread-safe-loader-2 branch 2 times, most recently from cf4614d to b573b14 Compare March 9, 2018 21:02
@smarsching
Copy link
Contributor Author

@rallytime Thank you very much. This was a bug that I introduced when fixing the problem reported by @mbologna. I added a unit test for detecting the bug and fixed it. Let's see whether the test suite runs successfully now.

@smarsching smarsching force-pushed the thread-safe-loader-2 branch from b573b14 to 0da5333 Compare March 9, 2018 23:56
salt/loader.py Outdated
# We use a double-checked locking scheme in order to avoid taking the lock
# when a proxy object has already been injected.
# In most programming languages, double-checked locking is considered
# unsafe when used without explicit memory barries because one might read

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

barriers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I fixed it.
It's good to see that someone actually reads the comments. 😃

There was a race condition in the salt loader when injecting global
values (e.g. "__pillar__" or "__salt__") into modules. One effect of
this race condition was that in a setup with multiple threads, some
threads may see pillar data intended for other threads or the pillar
data seen by a thread might even change spuriously.

There have been earlier attempts to fix this problem (saltstack#27937, saltstack#29397).
These patches tried to fix the problem by storing the dictionary that
keeps the relevant data in a thread-local variable and referencing this
thread-local variable from the variables that are injected into the
modules.

These patches did not fix the problem completely because they only
work when a module is loaded through a single loader instance only.
When there is more than one loader, there is more than one
thread-local variable and the variable injected into a module is
changed to point to another thread-local variable when the module is
loaded again. Thus, the problem resurfaced while working on saltstack#39670.

This patch attempts to solve the problem from a slightly different
angle, complementing the earlier patches: The value injected into the
modules now is a proxy that internally uses a thread-local variable to
decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar
data), the data is actually only changed in the thread in which the
loader is used. Other threads are not affected by such a change.

This means that it will work correctly in the current situation where
loaders are possibly created by many different modules and these
modules do not necessary know in which context they are executed. Thus
it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.

Unfortunately, the stand JSON and Msgpack serialization code cannot
handle proxied objects, so they have to be unwrapped before passing them
to that code.

The salt.utils.json module has been modified to  takes care of
unwrapping objects that are proxied using the ThreadLocalProxy.

The salt.utils.msgpack module has been added and basically provides the
same functions as the salt.utils.json module, but for msgpack. Like the
json module, it takes care of unwrapping proxies.
@smarsching smarsching force-pushed the thread-safe-loader-2 branch from 0da5333 to cca87c6 Compare March 10, 2018 10:42
@smarsching
Copy link
Contributor Author

@rallytime The Jenkins logs look better now, but there are still some errors (that seem to be related to timeouts). You probably know better whether this is expected or not.

@rallytime
Copy link
Contributor

re-run py

@rallytime
Copy link
Contributor

rallytime commented Mar 12, 2018

@smarsching The test processes are starting just fine now, but now I am seeing the following stacktrace through the test suite.. Can you take another look?

14:36:29  * Starting salt-master ... 
                                                                                                                                                                
 * Starting salt-master ... STARTED!
14:36:41  * Starting salt-minion ... 14:36:41,675 [salt.master                                :1056][ERROR   ] Received malformed command {'opts': {'publisher_acl_blacklist': {}}}
14:36:41 14:36:41,676 [salt.transport.zeromq                      :687 ][ERROR   ] Some exception handling a payload from minion
14:36:41 Traceback (most recent call last):
14:36:41   File "/testing/salt/transport/zeromq.py", line 683, in handle_message
14:36:41     ret, req_opts = yield self.payload_handler(payload)
14:36:41 ValueError: need more than 0 values to unpack
14:36:41 14:36:41,676 [tornado.general                            :445 ][ERROR   ] Uncaught exception, closing connection.
14:36:41 Traceback (most recent call last):
14:36:41   File "/usr/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 437, in _handle_events
14:36:41     self._handle_send()
14:36:41   File "/usr/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_send
14:36:41     status = self.socket.send_multipart(msg, **kwargs)
14:36:41   File "/usr/lib/python2.7/dist-packages/zmq/sugar/socket.py", line 238, in send_multipart
14:36:41     return self.send(msg_parts[-1], flags, copy=copy, track=track)
14:36:41   File "socket.pyx", line 565, in zmq.backend.cython.socket.Socket.send (zmq/backend/cython/socket.c:5263)
14:36:41   File "socket.pyx", line 605, in zmq.backend.cython.socket.Socket.send (zmq/backend/cython/socket.c:4981)
14:36:41 TypeError: unicode not allowed, use send_string
14:36:41 14:36:41,677 [tornado.application                        :612 ][ERROR   ] Exception in callback None
14:36:41 Traceback (most recent call last):
14:36:41 
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 866, in start
14:36:41     handler_func(fd_obj, events)
14:36:41   File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
14:36:41     return fn(*args, **kwargs)
14:36:41   File "/usr/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 437, in _handle_events
14:36:41     self._handle_send()
14:36:41   File "/usr/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_send
14:36:41     status = self.socket.send_multipart(msg, **kwargs)
14:36:41   File "/usr/lib/python2.7/dist-packages/zmq/sugar/socket.py", line 238, in send_multipart
14:36:41     return self.send(msg_parts[-1], flags, copy=copy, track=track)
14:36:41   File "socket.pyx", line 565, in zmq.backend.cython.socket.Socket.send (zmq/backend/cython/socket.c:5263)
14:36:41   File "socket.pyx", line 605, in zmq.backend.cython.socket.Socket.send (zmq/backend/cython/socket.c:4981)
14:36:41 TypeError: unicode not allowed, use send_string
14:36:41 

@smarsching
Copy link
Contributor Author

@rallytime: After working on this problem for almost a day, I am still puzzled. I have verified that the problem is indeed caused by my patch, but I cannot find the reason why it happens. Interestingly, it only appears when running the tests with --transport=zeromq.

The problems when starting the Minion seem to be related to the message

TypeError: Frame 0 ('Some exception handling minion ...) does not support the buffer interface.

in the master.log file, but I suspect this message is rather a consequence of the failure rather than a hint to its cause.

I suspect that the root cause of the problem is that the ZeroMQ serialization code somehow sees a proxy object and does not know how to handle it. But instead of throwing a proper exception right there, the code does something strange which then results in the problems at a later point in time. Debugging the whole thing is not made any easier by the fact that most of the logic is asynchronous, so it is hard to track the (real) call paths.

I clearly need more time to understand the whole ZeroMQ transport logic, but at the moment I do not have that time due to some other urgent projects. I hope that I will be able to investigate these problems in a few weeks (maybe earlier if I can find some spare time).

@rallytime
Copy link
Contributor

@DmitryKuzmenko We'd like to get this in for a future 2018.3.x release, and @smarsching has spent quite a bit of time trying to get this finished up. @DmitryKuzmenko would you be willing to take a look at this and help?

@mattp- You might be interested in looking at this as well.

@mattp-
Copy link
Contributor

mattp- commented May 22, 2018

@rallytime sure i can take a stab at seeing if i can figure something out, though I'll admit this isn't my wheelhouse.
On slack @cachedout said this PR might be fixed a different way with #46641 ; do you have any insight there mike or dmitry?

@rallytime
Copy link
Contributor

@mattp- That's totally fine, I just thought you might be interested in this one. That's a good point about #46641.

@smarsching Thoughts on that as well?

@smarsching
Copy link
Contributor Author

First of all, sorry you have not header anything from me about this PR for such a long time. I simply have not found the time to dig into the problems with ZeroMQ yet and at the moment it looks like I am not going to find the time for at least another two or three weeks.

Now more to the point: @rallytime I do not think that #46641 addresses the same issue. I only took a short glance at the changes introduced by that PR, so I might have missed something, but it looks to me like it mainly addresses a race condition in the loader code itself (that can occur due to the GIL being released in certain situations). It does not address the issue that the loader may inject objects (like the pillar tree) that should be thread local, but effectively are not.

From an architectural point of view, my PR is a bit of a hack that tries to work around the fact that some of the objects injected into modules are basically global variables that represent a context. Obviously, this is not a very good design and when starting from scratch, one would certainly try to avoid something like this, but removing that now would mean to make changes to virtually every module of Salt (including custom modules which are not even available as part of the Salt distribution itself), so introducing the proxy object looks like a good trade-off.

In fact, as far as (pure) Python code is involved, the proxy objects behave (almost) like the original objects so it does not really matter to the code using them. Unfortunately, this not true for Python modules that use C code, because (depending on what the C code does) the concrete type actually does matter to the code.

So far I have (at least I believe so) ensured that proxies are "unpacked" before passing them to serialization code of the Json and MsgPack libraries. However, the same problem seems to apply to the serialization code of ZeroMQ. Unfortunately, I have not figured out yet, where the ZeroMQ code gets the proxied objects. Part of that problem is that I am not very familiar with neither Tornado nor ZeroMQ, so I do not fully understand the data flow yet. Maybe someone with a better understanding of these libraries might have a better chance to figure out how the proxied objects end up in the ZeroMQ serialization layer.

@isbm
Copy link
Contributor

isbm commented Jun 26, 2018

@smarsching is there any kind of hope you are going to look into this? You've done lots of work here, but seems like PR is now stale. The problem is still here and either others would need to discard it all and go again themselves, or you just finish it. The World is waiting for you, hero! 😉

What you say?

@mattp- mattp- mentioned this pull request Jul 16, 2018
@damon-atkins
Copy link
Contributor

Go Go Jenkins!

@dubb-b
Copy link

dubb-b commented Oct 31, 2018

re-run docs

@smarsching
Copy link
Contributor Author

This PR has been replaced by #50655.

@smarsching smarsching closed this Jan 25, 2019
@smarsching smarsching deleted the thread-safe-loader-2 branch December 9, 2022 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.