-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replica hangs in full sync #3679
Comments
how do you know it hangs? maybe it's master that hangs? |
I don't know - just it never completes the sync
Yep sure will add (Edit: Actually it's only the 'nightly' suite that fails which only runs once a day, so I've just updated to not delete the datastore if it fails to we can inspect before cleaning up manually) |
how do you check if sync was completed? based on the "info" command? |
yeah, once the replica is connected to the expected master and 'sync in progress' is false |
This happened again (twice actually) and captured Replica shows sync in progress for over a day:
The datastore replica was stuck in this state for hours, again the latest replica logs shows:
|
Master shows 2 replicas.
role:master
connected_slaves:2
slave0:ip=10.0.34.241,port=6385,state=full_sync,lag=0
slave1:ip=10.0.41.27,port=6385,state=stable_sync,lag=0
master_replid:8040121f00740ce4f57f695be5a82ce557cd56e4
Do you happen to know which one got stuck?
…On Thu, Sep 19, 2024 at 6:47 PM Andy Dunstall ***@***.***> wrote:
This happened again and captured INFO this time
info.zip <https://github.com/user-attachments/files/17062289/info.zip>
The datastore replica was stuck in this state for hours, again the latest
replica logs shows:
I20240918 12:30:41.798735 1800 replica.cc:566] Started full sync with 10.0.37.98:9999
I20240918 12:34:24.742923 1799 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:27.340220 1800 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:32.594964 1801 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:38.525820 1795 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:35:11.956713 1796 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
—
Reply to this email directly, view it on GitHub
<#3679 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4BFCCAT7VCFPNRSRH3PBLZXLW2DAVCNFSM6AAAAABN4SAWHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRRGM4DQNBRHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Roman Gershman
CTO
---
*www.dragonflydb.io <http://www.dragonflydb.io>*
|
10.0.34.241 It was in staging so we have full instance logs, metrics and state etc if it's useful |
yeah, they are useful. Please attach here. |
Looking through control plane logs to get the exact sequence of events (as this only happens in one particular test). Focusing on datastore dst_esfncx612, it starts with nodes:
Then is updated to updated to:
With steps:
Then node A syncs with B, but node C hangs. Therefore
Are both of those cases ok? Rather than upload full logs and metrics here (which contains internal info), probably easiest to download with dfcloud? Quickly comparing logs of successful cases vs failed cases, the master (node B above) always seems to log |
We do not support replicating a replica (Node C is configured as a replica of B, before B is configured as a master). |
Ah ok thanks - will update control plane (FWIW the test often still succeed following the above steps, i.e. replicating a replica) |
Hi @andydunstall |
@chakaz I think we should reject transitive replication from replica - if we do not support it. |
Yep sure I've patched control plane |
@chakaz please see my comment to Andy above |
Yes, I'm on it! |
We do not support connecting a replica to a replica, but before this PR we allowed doing so. This PR disables that behavior. Fixes #3679
* chore: Forbid replicating a replica We do not support connecting a replica to a replica, but before this PR we allowed doing so. This PR disables that behavior. Fixes #3679 * `replicaof_mu_`
Describe the bug
We sometimes see replicas getting stuck in full sync as part of our test suite
Such as the database items graphs show:
Where the replica gets most of the keys but they just hangs without completing the sync
The replica logs show:
So it seems to get all the keys from the full sync but never transitions to stable sync?
(I can send full datastore logs over if needed)
To Reproduce
I don't have a reliable way to reproduce
This sometimes happens in our test case, where we populate a datastore with two replicas with 75m keys (~75GB), then kill the master (
SIGKILL
) to one of the replicas is promoted to master and the other becomes a replica of the new masterThe new replica connects to the new master, but then hangs as described above
It probably isn't a lot to go on, but we can enable any logs you suggest if it helps debug the issue
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: