-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make LazyLoader thread safe #46641
Make LazyLoader thread safe #46641
Conversation
Hi @skizunov I'm fairly confident this bug is as you describe and has been present for some time. I'm a little concerned, though, that the tests seem to indicate that there are cases where the master can't be contacted. I'm going to restart the tests, so let's see what happens on the next run. |
Go Go Jenkins! |
Even when `multiprocessing` is set to `True`, there is a case where multiple threads in the same process attempt to use the same LazyLoader object. When using the reactor and reacting to an event that will call a runner, `salt.utils.reactor.ReactWrap.runner` will invoke `self.pool.fire_async(self.client_cache['runner'].low, args=(fun, kwargs))` potentially multiple times, each time using a thread from `salt.utils.process.ThreadPool`. Each thread will invoke `salt.client.mixins.SyncClientMixin.low` which in turn will invoke its `_low` and call `salt.utils.job.store_job`. `salt.utils.job.store_job` will invoke the LazyLoader object for the returner. Since the LazyLoader object is not thread safe, occasional failures will occur which will reduce the reliability of the overall system. Let's examine why a function such as `LazyLoader._load` is not thread safe. Any time the GIL is released, it allows another thread to run. There are various types of operations that could release the GIL, but in this particular case they are file operations that happen in both `refresh_file_mapping` and `_load_module`. Note that if you add `print` statements, those also release the GIL (and make the problem more frequent). In the failure case, `refresh_file_mapping` releases the GIL, another thread loads the module, and then when the original thread runs again it will fail when `_inner_load` runs the second time (after `refresh_file_mapping`). The failure is because the module is already in `self.loaded_files`, so it is skipped over and `_inner_load` returns `False` even though the required `key` is already in `self._dict`. Since adding in stuff like `print` statements, or other logic also adds points in the code that allow thread switches, the most robust solution to such a problem is to use a mutex (as opposed to rechecking if `key` now appears in `self._dict` at certain checkpoints). This solution adds such a mutex and uses it in key places to ensure integrity. Signed-off-by: Sergey Kizunov <[email protected]>
@cachedout The latest run looks better. Those current failures happen sometimes on the PR runs. Do you want to review this again? |
@DmitryKuzmenko looks like this supersedes and fixes whats referenced in 45782 can you confirm? I'll close #48598 if so. thanks |
What does this PR do?
Even when
multiprocessing
is set toTrue
, there is a case wheremultiple threads in the same process attempt to use the same LazyLoader
object. When using the reactor and reacting to an event that will call a
runner,
salt.utils.reactor.ReactWrap.runner
will invokeself.pool.fire_async(self.client_cache['runner'].low, args=(fun, kwargs))
potentially multiple times, each time using a thread from
salt.utils.process.ThreadPool
. Each thread will invokesalt.client.mixins.SyncClientMixin.low
which in turn will invoke its_low
and callsalt.utils.job.store_job
.salt.utils.job.store_job
will invoke the LazyLoader object for the returner.
Since the LazyLoader object is not thread safe, occasional failures will
occur which will reduce the reliability of the overall system.
Let's examine why a function such as
LazyLoader._load
is not thread safe.Any time the GIL is released, it allows another thread to run. There are
various types of operations that could release the GIL, but in this
particular case they are file operations that happen in both
refresh_file_mapping
and_load_module
. Note that if you addprint
statements, those also release the GIL (and make the problem more
frequent). In the failure case,
refresh_file_mapping
releases theGIL, another thread loads the module, and then when the original thread
runs again it will fail when
_inner_load
runs the second time (afterrefresh_file_mapping
). The failure is because the module is already inself.loaded_files
, so it is skipped over and_inner_load
returnsFalse
even though the requiredkey
is already inself._dict
. Sinceadding in stuff like
print
statements, or other logic also adds points inthe code that allow thread switches, the most robust solution to such a
problem is to use a mutex (as opposed to rechecking if
key
now appearsin
self._dict
at certain checkpoints).This solution adds such a mutex and uses it in key places to ensure
integrity.
Tests written?
No
Commits signed with GPG?
Yes