Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LazyLoader thread safe #46641

Merged
merged 1 commit into from
Apr 3, 2018
Merged

Make LazyLoader thread safe #46641

merged 1 commit into from
Apr 3, 2018

Conversation

skizunov
Copy link
Contributor

What does this PR do?

Even when multiprocessing is set to True, there is a case where
multiple threads in the same process attempt to use the same LazyLoader
object. When using the reactor and reacting to an event that will call a
runner, salt.utils.reactor.ReactWrap.runner will invoke
self.pool.fire_async(self.client_cache['runner'].low, args=(fun, kwargs))
potentially multiple times, each time using a thread from
salt.utils.process.ThreadPool. Each thread will invoke
salt.client.mixins.SyncClientMixin.low which in turn will invoke its
_low and call salt.utils.job.store_job. salt.utils.job.store_job
will invoke the LazyLoader object for the returner.

Since the LazyLoader object is not thread safe, occasional failures will
occur which will reduce the reliability of the overall system.

Let's examine why a function such as LazyLoader._load is not thread safe.
Any time the GIL is released, it allows another thread to run. There are
various types of operations that could release the GIL, but in this
particular case they are file operations that happen in both
refresh_file_mapping and _load_module. Note that if you add print
statements, those also release the GIL (and make the problem more
frequent). In the failure case, refresh_file_mapping releases the
GIL, another thread loads the module, and then when the original thread
runs again it will fail when _inner_load runs the second time (after
refresh_file_mapping). The failure is because the module is already in
self.loaded_files, so it is skipped over and _inner_load returns
False even though the required key is already in self._dict. Since
adding in stuff like print statements, or other logic also adds points in
the code that allow thread switches, the most robust solution to such a
problem is to use a mutex (as opposed to rechecking if key now appears
in self._dict at certain checkpoints).

This solution adds such a mutex and uses it in key places to ensure
integrity.

Tests written?

No

Commits signed with GPG?

Yes

@skizunov skizunov requested a review from a team as a code owner March 21, 2018 14:57
@cachedout
Copy link
Contributor

Hi @skizunov

I'm fairly confident this bug is as you describe and has been present for some time.

I'm a little concerned, though, that the tests seem to indicate that there are cases where the master can't be contacted. I'm going to restart the tests, so let's see what happens on the next run.

@cachedout
Copy link
Contributor

Go Go Jenkins!

Even when `multiprocessing` is set to `True`, there is a case where
multiple threads in the same process attempt to use the same LazyLoader
object. When using the reactor and reacting to an event that will call a
runner, `salt.utils.reactor.ReactWrap.runner` will invoke
`self.pool.fire_async(self.client_cache['runner'].low, args=(fun, kwargs))`
potentially multiple times, each time using a thread from
`salt.utils.process.ThreadPool`. Each thread will invoke
`salt.client.mixins.SyncClientMixin.low` which in turn will invoke its
`_low` and call `salt.utils.job.store_job`. `salt.utils.job.store_job`
will invoke the LazyLoader object for the returner.

Since the LazyLoader object is not thread safe, occasional failures will
occur which will reduce the reliability of the overall system.

Let's examine why a function such as `LazyLoader._load` is not thread safe.
Any time the GIL is released, it allows another thread to run. There are
various types of operations that could release the GIL, but in this
particular case they are file operations that happen in both
`refresh_file_mapping` and `_load_module`. Note that if you add `print`
statements, those also release the GIL (and make the problem more
frequent). In the failure case, `refresh_file_mapping` releases the
GIL, another thread loads the module, and then when the original thread
runs again it will fail when `_inner_load` runs the second time (after
`refresh_file_mapping`). The failure is because the module is already in
`self.loaded_files`, so it is skipped over and `_inner_load` returns
`False` even though the required `key` is already in `self._dict`. Since
adding in stuff like `print` statements, or other logic also adds points in
the code that allow thread switches, the most robust solution to such a
problem is to use a mutex (as opposed to rechecking if `key` now appears
in `self._dict` at certain checkpoints).

This solution adds such a mutex and uses it in key places to ensure
integrity.

Signed-off-by: Sergey Kizunov <[email protected]>
@rallytime
Copy link
Contributor

@cachedout The latest run looks better. Those current failures happen sometimes on the PR runs. Do you want to review this again?

@rallytime rallytime merged commit 37f6d2d into saltstack:2017.7 Apr 3, 2018
@mattp-
Copy link
Contributor

mattp- commented Nov 12, 2018

@DmitryKuzmenko looks like this supersedes and fixes whats referenced in 45782 can you confirm? I'll close #48598 if so. thanks

@DmitryKuzmenko
Copy link
Contributor

@mattp- I don't think so. The corner idea of #45782 is using of a thread local store to keep the injected values. But this one just takes care of the race conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants