-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication deadlock when replica times out #3649
Comments
Thanks @andydunstall , I confirm I can easily reproduce it locally. |
Seems that the callback below does not proceed. Checking whether it's join or the cleanup callback.
|
so Seems the following deadlock happens:
and then:
which is stuck on RecordChannel not being emptied. but then again I repeat again - our inter component dependencies are very complicated and using mutexes leads to unexpected outcomes. |
The full stacktrace is here:
|
#3171 is also related why we added a mutex. |
possibly reversing socket shutdown and cancel might solve this issue:
|
We're seeing a Dragonfly replication deadlock on our test suite, which seems to happen when a replica times out.
I can reproduce the replication deadlock locally (or at least what I'm assuming is the same issue).
Running two Dragonfly processes, but intentionally setting the master process
replication_timeout
to a very small value (100ms) to force the replica to timeout:Then start populating the master with 5GB:
Then while its populating configure the replica:
The replica will timeout as expected, but then the master is partially deadlocked.
INFO
hangs even after you shutdown the replica, and attempting to add another replica hangs as the master is unresponsive.Running v1.22.0 on AWS
t4g.medium
. I added some logs and I think its a deadlock attempting to lockDflyCmd::mu_
but thats as far as I gotEdit: poking around a bit out of curiosity,
DflyCmd::BreakStalledFlowsInShard
never releasesmu_
as it blocks onreplica_ptr->Cancel()
The text was updated successfully, but these errors were encountered: