Make LazyLoader thread safe #46641

skizunov · 2018-03-21T14:57:43Z

What does this PR do?

Even when multiprocessing is set to True, there is a case where
multiple threads in the same process attempt to use the same LazyLoader
object. When using the reactor and reacting to an event that will call a
runner, salt.utils.reactor.ReactWrap.runner will invoke
self.pool.fire_async(self.client_cache['runner'].low, args=(fun, kwargs))
potentially multiple times, each time using a thread from
salt.utils.process.ThreadPool. Each thread will invoke
salt.client.mixins.SyncClientMixin.low which in turn will invoke its
_low and call salt.utils.job.store_job. salt.utils.job.store_job
will invoke the LazyLoader object for the returner.

Since the LazyLoader object is not thread safe, occasional failures will
occur which will reduce the reliability of the overall system.

Let's examine why a function such as LazyLoader._load is not thread safe.
Any time the GIL is released, it allows another thread to run. There are
various types of operations that could release the GIL, but in this
particular case they are file operations that happen in both
refresh_file_mapping and _load_module. Note that if you add print
statements, those also release the GIL (and make the problem more
frequent). In the failure case, refresh_file_mapping releases the
GIL, another thread loads the module, and then when the original thread
runs again it will fail when _inner_load runs the second time (after
refresh_file_mapping). The failure is because the module is already in
self.loaded_files, so it is skipped over and _inner_load returns
False even though the required key is already in self._dict. Since
adding in stuff like print statements, or other logic also adds points in
the code that allow thread switches, the most robust solution to such a
problem is to use a mutex (as opposed to rechecking if key now appears
in self._dict at certain checkpoints).

This solution adds such a mutex and uses it in key places to ensure
integrity.

Tests written?

No

Commits signed with GPG?

Yes

cachedout · 2018-03-21T18:40:16Z

Hi @skizunov

I'm fairly confident this bug is as you describe and has been present for some time.

I'm a little concerned, though, that the tests seem to indicate that there are cases where the master can't be contacted. I'm going to restart the tests, so let's see what happens on the next run.

cachedout · 2018-03-21T18:40:23Z

Go Go Jenkins!

Even when `multiprocessing` is set to `True`, there is a case where multiple threads in the same process attempt to use the same LazyLoader object. When using the reactor and reacting to an event that will call a runner, `salt.utils.reactor.ReactWrap.runner` will invoke `self.pool.fire_async(self.client_cache['runner'].low, args=(fun, kwargs))` potentially multiple times, each time using a thread from `salt.utils.process.ThreadPool`. Each thread will invoke `salt.client.mixins.SyncClientMixin.low` which in turn will invoke its `_low` and call `salt.utils.job.store_job`. `salt.utils.job.store_job` will invoke the LazyLoader object for the returner. Since the LazyLoader object is not thread safe, occasional failures will occur which will reduce the reliability of the overall system. Let's examine why a function such as `LazyLoader._load` is not thread safe. Any time the GIL is released, it allows another thread to run. There are various types of operations that could release the GIL, but in this particular case they are file operations that happen in both `refresh_file_mapping` and `_load_module`. Note that if you add `print` statements, those also release the GIL (and make the problem more frequent). In the failure case, `refresh_file_mapping` releases the GIL, another thread loads the module, and then when the original thread runs again it will fail when `_inner_load` runs the second time (after `refresh_file_mapping`). The failure is because the module is already in `self.loaded_files`, so it is skipped over and `_inner_load` returns `False` even though the required `key` is already in `self._dict`. Since adding in stuff like `print` statements, or other logic also adds points in the code that allow thread switches, the most robust solution to such a problem is to use a mutex (as opposed to rechecking if `key` now appears in `self._dict` at certain checkpoints). This solution adds such a mutex and uses it in key places to ensure integrity. Signed-off-by: Sergey Kizunov <[email protected]>

rallytime · 2018-03-26T19:19:38Z

@cachedout The latest run looks better. Those current failures happen sometimes on the PR runs. Do you want to review this again?

mattp- · 2018-11-12T15:04:00Z

@DmitryKuzmenko looks like this supersedes and fixes whats referenced in 45782 can you confirm? I'll close #48598 if so. thanks

DmitryKuzmenko · 2018-11-15T09:06:03Z

@mattp- I don't think so. The corner idea of #45782 is using of a thread local store to keep the injected values. But this one just takes care of the race conditions.

skizunov requested a review from a team as a code owner March 21, 2018 14:57

DmitryKuzmenko approved these changes Mar 21, 2018

View reviewed changes

DmitryKuzmenko added the Awesome label Mar 21, 2018

skizunov force-pushed the develop3 branch from aef3ea0 to c624aa4 Compare March 23, 2018 13:44

cachedout approved these changes Apr 3, 2018

View reviewed changes

rallytime merged commit 37f6d2d into saltstack:2017.7 Apr 3, 2018

mattp- mentioned this pull request May 22, 2018

Fix race condition in Salt loader. #45782

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make LazyLoader thread safe #46641

Make LazyLoader thread safe #46641

skizunov commented Mar 21, 2018

cachedout commented Mar 21, 2018

cachedout commented Mar 21, 2018

rallytime commented Mar 26, 2018

mattp- commented Nov 12, 2018

DmitryKuzmenko commented Nov 15, 2018

Make LazyLoader thread safe #46641

Make LazyLoader thread safe #46641

Conversation

skizunov commented Mar 21, 2018

What does this PR do?

Tests written?

Commits signed with GPG?

cachedout commented Mar 21, 2018

cachedout commented Mar 21, 2018

rallytime commented Mar 26, 2018

mattp- commented Nov 12, 2018

DmitryKuzmenko commented Nov 15, 2018