-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in Salt loader. #45782
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
Just one minor change in doc.
salt/utils/thread_local_proxy.py
Outdated
Proxy object that can reference different values depending on the current | ||
thread of execution. | ||
|
||
..versionadded:: Nitrogen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oxygen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This simply points to how old origins of this PR are. 😃 I fixed it.
Does this pr need to be pointed at oxygen.rc1? |
@gtmanfred No, when I talked to mike about this, we decided to have this go into oxygen for the .1 release since we're hardening the branch right now. |
cool 👍 sounds good |
@smarsching would you mind rebasing this? it looks like it should be pretty simple, just was adding encoding to the msgpack stuff in salt.utils.cloud |
c95c3eb
to
2b5cbd7
Compare
@gtmanfred I rebased it to the newest version of the |
Thus far I am OK with this but I want to sit down with @thatch45 and discuss this line-by-line before I approve this. These are substantial changes but overall, I think it looks pretty good. |
@@ -97,14 +97,17 @@ | |||
try: | |||
# Attempt to import msgpack | |||
import msgpack | |||
import salt.utils.msgpack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why import if we're not using it?
@@ -456,7 +459,7 @@ def _make_packet(self, label, timestamp, data): | |||
packet = (tag, timestamp, data) | |||
if self.verbose: | |||
print(packet) | |||
return msgpack.packb(packet) | |||
return salt.utils.msgpack.packb(packet, _msgpack_module=msgpack) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, we are using it here...
While setting up a salt-minion with this patch applied we encountered the following errors:
|
@mbologna Good catch! I am not sure why this problem did not surface during my tests (maybe because I tested them with an older Salt version). I have pushed two further commits. The first one fixes the issue that you found, the second one fixes a related issue. I am not sure whether the second one has any practical significance (it is quite possible that having such a recursive data structure would lead to the same problem in the serialization code), but I think it was still worth fixing. I also included unit tests for both issues in order to ensure that they are actually fixed. Could you please test it again with the revised patch? |
@smarsching Ah, this needs a rebase. Can you fix that? Looks like there might be a lint error, too, but the build has timed out. Once this is rebased and errors are fixed, we can see if @cachedout and @thatch45 can go through this one last time. :) |
e041d82
to
18b8162
Compare
@rallytime I rebased the branch and fixed the lint problems. |
re-run ubuntu |
@smarsching Looks like something isn't quite right here. I am seeing this stacktrace right when the test daemons start up:
|
cf4614d
to
b573b14
Compare
@rallytime Thank you very much. This was a bug that I introduced when fixing the problem reported by @mbologna. I added a unit test for detecting the bug and fixed it. Let's see whether the test suite runs successfully now. |
b573b14
to
0da5333
Compare
salt/loader.py
Outdated
# We use a double-checked locking scheme in order to avoid taking the lock | ||
# when a proxy object has already been injected. | ||
# In most programming languages, double-checked locking is considered | ||
# unsafe when used without explicit memory barries because one might read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
barriers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I fixed it.
It's good to see that someone actually reads the comments. 😃
There was a race condition in the salt loader when injecting global values (e.g. "__pillar__" or "__salt__") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously. There have been earlier attempts to fix this problem (saltstack#27937, saltstack#29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules. These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on saltstack#39670. This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change. This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach used by the two earlier patches. Unfortunately, the stand JSON and Msgpack serialization code cannot handle proxied objects, so they have to be unwrapped before passing them to that code. The salt.utils.json module has been modified to takes care of unwrapping objects that are proxied using the ThreadLocalProxy. The salt.utils.msgpack module has been added and basically provides the same functions as the salt.utils.json module, but for msgpack. Like the json module, it takes care of unwrapping proxies.
0da5333
to
cca87c6
Compare
@rallytime The Jenkins logs look better now, but there are still some errors (that seem to be related to timeouts). You probably know better whether this is expected or not. |
re-run py |
@smarsching The test processes are starting just fine now, but now I am seeing the following stacktrace through the test suite.. Can you take another look?
|
@rallytime: After working on this problem for almost a day, I am still puzzled. I have verified that the problem is indeed caused by my patch, but I cannot find the reason why it happens. Interestingly, it only appears when running the tests with The problems when starting the Minion seem to be related to the message
in the I suspect that the root cause of the problem is that the ZeroMQ serialization code somehow sees a proxy object and does not know how to handle it. But instead of throwing a proper exception right there, the code does something strange which then results in the problems at a later point in time. Debugging the whole thing is not made any easier by the fact that most of the logic is asynchronous, so it is hard to track the (real) call paths. I clearly need more time to understand the whole ZeroMQ transport logic, but at the moment I do not have that time due to some other urgent projects. I hope that I will be able to investigate these problems in a few weeks (maybe earlier if I can find some spare time). |
@DmitryKuzmenko We'd like to get this in for a future 2018.3.x release, and @smarsching has spent quite a bit of time trying to get this finished up. @DmitryKuzmenko would you be willing to take a look at this and help? @mattp- You might be interested in looking at this as well. |
@rallytime sure i can take a stab at seeing if i can figure something out, though I'll admit this isn't my wheelhouse. |
@mattp- That's totally fine, I just thought you might be interested in this one. That's a good point about #46641. @smarsching Thoughts on that as well? |
First of all, sorry you have not header anything from me about this PR for such a long time. I simply have not found the time to dig into the problems with ZeroMQ yet and at the moment it looks like I am not going to find the time for at least another two or three weeks. Now more to the point: @rallytime I do not think that #46641 addresses the same issue. I only took a short glance at the changes introduced by that PR, so I might have missed something, but it looks to me like it mainly addresses a race condition in the loader code itself (that can occur due to the GIL being released in certain situations). It does not address the issue that the loader may inject objects (like the pillar tree) that should be thread local, but effectively are not. From an architectural point of view, my PR is a bit of a hack that tries to work around the fact that some of the objects injected into modules are basically global variables that represent a context. Obviously, this is not a very good design and when starting from scratch, one would certainly try to avoid something like this, but removing that now would mean to make changes to virtually every module of Salt (including custom modules which are not even available as part of the Salt distribution itself), so introducing the proxy object looks like a good trade-off. In fact, as far as (pure) Python code is involved, the proxy objects behave (almost) like the original objects so it does not really matter to the code using them. Unfortunately, this not true for Python modules that use C code, because (depending on what the C code does) the concrete type actually does matter to the code. So far I have (at least I believe so) ensured that proxies are "unpacked" before passing them to serialization code of the Json and MsgPack libraries. However, the same problem seems to apply to the serialization code of ZeroMQ. Unfortunately, I have not figured out yet, where the ZeroMQ code gets the proxied objects. Part of that problem is that I am not very familiar with neither Tornado nor ZeroMQ, so I do not fully understand the data flow yet. Maybe someone with a better understanding of these libraries might have a better chance to figure out how the proxied objects end up in the ZeroMQ serialization layer. |
@smarsching is there any kind of hope you are going to look into this? You've done lots of work here, but seems like PR is now stale. The problem is still here and either others would need to discard it all and go again themselves, or you just finish it. The World is waiting for you, hero! 😉 What you say? |
Go Go Jenkins! |
re-run docs |
This PR has been replaced by #50655. |
What does this PR do?
This PR is a new version of #39948, so the description has been copied from there. The only difference to the original PR is that this PR is targeted at the
oxygen
branch instead of thedevelop
branch.There was a race condition in the salt loader when injecting global values (e.g. "pillar" or "salt") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously.
There have been earlier attempts to fix this problem (#27937, #29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules.
These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on #39670.
This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change.
This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.
What issues does this PR fix or reference?
This PR fixes problems that surfaced when developing the parallel runners feature (#39670), but is also related to problems reported earlier (#23373). The problems described in #29028 might also be related, though this needs further confirmation.
Previous Behavior
Changes to pillar data in one thread might have influenced pillar data in a different thread. Whether or when this problem actually appeared was unpredictable (due to the nature of a race condition).
New Behavior
Changes to pillar data in one thread will never affect the pillar data in a different thread.
Tests written?
No
Regression Potential
The change to the loader code itself is rather small, but it affects a central component. So if there is a problem, the impact might affect nearly all components.