Fix rules engine unsubscribe race condition #731

daxhaw · 2021-06-12T20:06:36Z

A race condition was uncovered by timing changes in how the tests are run. The cases are all caused by the test explicitly unsubscribing rules while rules were still executing. This change ensures that a call to unsubscribe rules waits for the rules "graph" to finish before unsubscribing. Also re-enabled test_queries_code.new_registration test that was failing.

Both ProductionGaiaRelaase_gdev and Production_gdev_18_04 have passed so far.

…e was active

senderista

Looks good, just a few dumb questions about the synchronization logic.

senderista · 2021-06-12T22:16:07Z

production/rules/event_manager/src/event_manager.cpp

+
+        // Detach the commit trigger so that any new events that come in do not try
+        // to look for rule subscriptions while we are removing them.
+        gaia::db::set_commit_trigger(nullptr);


Super-dumb question (because I don't know the rules engine code at all): since it seems to be safe to clear the commit trigger before calling wait_for_rules_to_finish() (since you do it immediately after clearing the commit trigger), why do we need to call wait_for_rules_to_finish() before calling set_commit_trigger()? Is this just a best-effort attempt to execute all existing rule invocations? Is that necessary at this point?

The first call to wait_for_rules_to_finish is to drain any invocations and any further rules that run because of those invocations. If I were to clear the commit trigger before doing this, I woudn't execute the entire intended graph of rules that were supposed to be run. After the engine is done with that, then it's a best effort to preclude firing any more rules by setting the commit trigger to nullptr. The second wait_for_rules_to_finish is to account for the case when some other procedural code made a change that caused a rule to fire before I had the chance to null out the commit trigger.

senderista · 2021-06-12T22:22:23Z

production/rules/event_manager/src/rule_thread_pool.cpp

+
+    unique_lock lock(m_lock, defer_lock);
+
+    // Wait for any scheduled rules to finish executing. Once the rule


Doesn't this comment belong with wait_for_rules_to_finish()?

I'll cleanup the comment to be more about the shutdown scenario here.

senderista · 2021-06-12T22:36:46Z

production/rules/event_manager/src/rule_thread_pool.cpp

+
+    auto start_shutdown_time = gaia::common::timer_t::get_time_point();
+
+    unique_lock lock(m_lock, defer_lock);


Given that this method doesn't seem to use the lock itself, why can't wait_for_rules_to_finish() just use m_lock directly? Oh, I guess shutdown() needs to hold the lock while it sets m_exit since that's not atomic and is protected by m_lock everywhere else?

Correct. I want it to set m_exit while the lock is held.

senderista · 2021-06-12T22:46:30Z

production/rules/event_manager/src/rule_thread_pool.cpp

+// TODO[GAIAPLAT-1020]: Add a configuration setting to limit the time
+// we will wait for all rules to execute.
+void rule_thread_pool_t::wait_for_rules_to_finish(std::unique_lock<std::mutex>& lock)
+{


So, assuming you no longer passed in the lock via defer_lock, I think you could just use an RAII wrapper like unique_lock if you just added an extra scope to the while loop and set a flag? It doesn't seem crucial that this method exits with the lock held, except for convenience in shutdown() not needing to acquire the lock again to set m_exit? (I'm probably missing something, of course.)

Right, this code used to live in the shutdown method and I wanted to preserve its locking behavior exactly.

senderista · 2021-06-12T22:48:03Z

production/rules/event_manager/src/rule_thread_pool.cpp

    unique_lock lock(m_lock, defer_lock);
+    wait_for_rules_to_finish(lock);


It looks to me like you don't really need defer_lock at all here since the caller isn't using the lock to protect anything. In that case I assume the lock parameter is purely for the benefit of rule_thread_pool_t::shutdown(), since it uses the lock to protect access to m_exit. I wonder if it would be simpler to just have shutdown() re-acquire the lock after wait_for_rules_to_finish() returns? Then you could just use a unique_lock in wait_for_rules_to_finish() rather than calling lock()/unlock() directly, and you wouldn't have to pass locks around with defer_lock. Does the potential synchronization overhead here justify introducing this complexity (e.g. the implicit contract of wait_for_rules_to_finish() always exiting with the lock held)?

I really don't want anyone messing with m_exit when trying to shutdown. The flag and the wait state of the threads need to be checked and set together. Note that the wait_for_rules_to_finish method that takes the lock is a private method only used in this class. There is a public wrapper which takes care of the lock for you for other classes that want to use it (most notably the event_manager_t class). I felt that the complexity was justified for use within the thread pool but did not want this to escape through the class public interface.

senderista · 2021-06-13T21:02:09Z

Thanks for the clarifications.

daxhaw · 2021-06-14T16:28:12Z

Thanks for looking this over, Tobin!

daxhaw added 2 commits June 9, 2021 22:57

disable test_queries_code.new_registration

8ca7fe6

Fix race condition when rules were unsubscribed while the rules engin…

7f68944

…e was active

daxhaw requested review from senderista and cevans87 June 12, 2021 20:06

daxhaw changed the title ~~Fix unsubscribe race condition~~ Fix rules engine unsubscribe race condition Jun 12, 2021

senderista approved these changes Jun 12, 2021

View reviewed changes

Clarify comment in shutdown

c4b771e

daxhaw merged commit 9b0f362 into master Jun 14, 2021

daxhaw deleted the fix_unsubscribe_race branch June 14, 2021 16:28

daxhaw mentioned this pull request Jun 14, 2021

Actually re-enable the new_registration test #734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rules engine unsubscribe race condition #731

Fix rules engine unsubscribe race condition #731

daxhaw commented Jun 12, 2021

senderista left a comment

senderista Jun 12, 2021

daxhaw Jun 13, 2021

senderista Jun 12, 2021

daxhaw Jun 13, 2021

senderista Jun 12, 2021

daxhaw Jun 13, 2021

senderista Jun 12, 2021

daxhaw Jun 13, 2021

senderista Jun 12, 2021

daxhaw Jun 13, 2021

senderista commented Jun 13, 2021

daxhaw commented Jun 14, 2021


		unique_lock lock(m_lock, defer_lock);

		// Wait for any scheduled rules to finish executing. Once the rule


		auto start_shutdown_time = gaia::common::timer_t::get_time_point();

		unique_lock lock(m_lock, defer_lock);

		unique_lock lock(m_lock, defer_lock);
		wait_for_rules_to_finish(lock);

Fix rules engine unsubscribe race condition #731

Fix rules engine unsubscribe race condition #731

Conversation

daxhaw commented Jun 12, 2021

senderista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

senderista commented Jun 13, 2021

daxhaw commented Jun 14, 2021