Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rules engine unsubscribe race condition #731

Merged
merged 3 commits into from
Jun 14, 2021
Merged

Conversation

daxhaw
Copy link
Contributor

@daxhaw daxhaw commented Jun 12, 2021

A race condition was uncovered by timing changes in how the tests are run. The cases are all caused by the test explicitly unsubscribing rules while rules were still executing. This change ensures that a call to unsubscribe rules waits for the rules "graph" to finish before unsubscribing. Also re-enabled test_queries_code.new_registration test that was failing.

Both ProductionGaiaRelaase_gdev and Production_gdev_18_04 have passed so far.

@daxhaw daxhaw requested review from senderista and cevans87 June 12, 2021 20:06
@daxhaw daxhaw changed the title Fix unsubscribe race condition Fix rules engine unsubscribe race condition Jun 12, 2021
Copy link
Contributor

@senderista senderista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few dumb questions about the synchronization logic.


// Detach the commit trigger so that any new events that come in do not try
// to look for rule subscriptions while we are removing them.
gaia::db::set_commit_trigger(nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-dumb question (because I don't know the rules engine code at all): since it seems to be safe to clear the commit trigger before calling wait_for_rules_to_finish() (since you do it immediately after clearing the commit trigger), why do we need to call wait_for_rules_to_finish() before calling set_commit_trigger()? Is this just a best-effort attempt to execute all existing rule invocations? Is that necessary at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first call to wait_for_rules_to_finish is to drain any invocations and any further rules that run because of those invocations. If I were to clear the commit trigger before doing this, I woudn't execute the entire intended graph of rules that were supposed to be run. After the engine is done with that, then it's a best effort to preclude firing any more rules by setting the commit trigger to nullptr. The second wait_for_rules_to_finish is to account for the case when some other procedural code made a change that caused a rule to fire before I had the chance to null out the commit trigger.


unique_lock lock(m_lock, defer_lock);

// Wait for any scheduled rules to finish executing. Once the rule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this comment belong with wait_for_rules_to_finish()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll cleanup the comment to be more about the shutdown scenario here.


auto start_shutdown_time = gaia::common::timer_t::get_time_point();

unique_lock lock(m_lock, defer_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this method doesn't seem to use the lock itself, why can't wait_for_rules_to_finish() just use m_lock directly? Oh, I guess shutdown() needs to hold the lock while it sets m_exit since that's not atomic and is protected by m_lock everywhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I want it to set m_exit while the lock is held.

// TODO[GAIAPLAT-1020]: Add a configuration setting to limit the time
// we will wait for all rules to execute.
void rule_thread_pool_t::wait_for_rules_to_finish(std::unique_lock<std::mutex>& lock)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, assuming you no longer passed in the lock via defer_lock, I think you could just use an RAII wrapper like unique_lock if you just added an extra scope to the while loop and set a flag? It doesn't seem crucial that this method exits with the lock held, except for convenience in shutdown() not needing to acquire the lock again to set m_exit? (I'm probably missing something, of course.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this code used to live in the shutdown method and I wanted to preserve its locking behavior exactly.

unique_lock lock(m_lock, defer_lock);
wait_for_rules_to_finish(lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like you don't really need defer_lock at all here since the caller isn't using the lock to protect anything. In that case I assume the lock parameter is purely for the benefit of rule_thread_pool_t::shutdown(), since it uses the lock to protect access to m_exit. I wonder if it would be simpler to just have shutdown() re-acquire the lock after wait_for_rules_to_finish() returns? Then you could just use a unique_lock in wait_for_rules_to_finish() rather than calling lock()/unlock() directly, and you wouldn't have to pass locks around with defer_lock. Does the potential synchronization overhead here justify introducing this complexity (e.g. the implicit contract of wait_for_rules_to_finish() always exiting with the lock held)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't want anyone messing with m_exit when trying to shutdown. The flag and the wait state of the threads need to be checked and set together. Note that the wait_for_rules_to_finish method that takes the lock is a private method only used in this class. There is a public wrapper which takes care of the lock for you for other classes that want to use it (most notably the event_manager_t class). I felt that the complexity was justified for use within the thread pool but did not want this to escape through the class public interface.

@senderista
Copy link
Contributor

Thanks for the clarifications.

@daxhaw
Copy link
Contributor Author

daxhaw commented Jun 14, 2021

Thanks for looking this over, Tobin!

@daxhaw daxhaw merged commit 9b0f362 into master Jun 14, 2021
@daxhaw daxhaw deleted the fix_unsubscribe_race branch June 14, 2021 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants