Replica hangs in full sync #3679

andydunstall · 2024-09-09T13:31:44Z

Describe the bug
We sometimes see replicas getting stuck in full sync as part of our test suite

Such as the database items graphs show:

Where the replica gets most of the keys but they just hangs without completing the sync

The replica logs show:

I20240909 12:02:25.220778  1843 replica.cc:566] Started full sync with 10.0.43.58:9999
I20240909 12:04:46.989950  1843 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240909 12:06:08.022770  1842 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240909 12:06:23.528414  1845 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240909 12:06:40.811838  1841 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240909 12:07:00.670682  1847 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END

(5 minutes pass before we delete the datastore and replica disconnects)

I20240909 12:11:54.027276  1844 rdb_load.cc:1141] Error while calling src_->Read(mb)
I20240909 12:11:54.027280  1843 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027284  1845 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027343  1844 rdb_load.cc:2549] ReadObj error system:103 for key test:61158666
I20240909 12:11:54.027280  1840 rdb_load.cc:1160] Error while calling src_->ReadAtLeast(mb, size)
I20240909 12:11:54.027284  1846 rdb_load.cc:1141] Error while calling src_->Read(mb)
I20240909 12:11:54.027350  1843 rdb_load.cc:1999] Error while calling FetchType()
I20240909 12:11:54.027383  1840 rdb_load.cc:2549] ReadObj error system:103 for key test:23694841
I20240909 12:11:54.027402  1842 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027402  1846 rdb_load.cc:2549] ReadObj error system:103 for key test:72764920
I20240909 12:11:54.027292  1841 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027352  1845 rdb_load.cc:1999] Error while calling FetchType()
I20240909 12:11:54.027402  1847 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027453  1842 rdb_load.cc:1999] Error while calling FetchType()
I20240909 12:11:54.027494  1847 rdb_load.cc:1999] Error while calling FetchType()
I20240909 12:11:54.027479  1841 rdb_load.cc:1999] Error while calling FetchType()
W20240909 12:11:54.027691  1843 replica.cc:243] Error syncing with 10.0.43.58:9999 system:103 Software caused connection abort

So it seems to get all the keys from the full sync but never transitions to stable sync?

(I can send full datastore logs over if needed)

To Reproduce
I don't have a reliable way to reproduce

This sometimes happens in our test case, where we populate a datastore with two replicas with 75m keys (~75GB), then kill the master (SIGKILL) to one of the replicas is promoted to master and the other becomes a replica of the new master

The new replica connects to the new master, but then hangs as described above

It probably isn't a lot to go on, but we can enable any logs you suggest if it helps debug the issue

Environment (please complete the following information):

Ubuntu AWS x2gd.2xlarge
Dragonfly v1.22.0

The text was updated successfully, but these errors were encountered:

romange · 2024-09-09T17:31:01Z

how do you know it hangs? maybe it's master that hangs?
in any case, is it possible to call uncoditionally "info all" on both master and replica before deleting the datastore?
and then of course printing both responses into the test logs.

andydunstall · 2024-09-09T17:38:00Z

how do you know it hangs? maybe it's master that hangs?

I don't know - just it never completes the sync

in any case, is it possible to call uncoditionally "info all" on both master and replica before deleting the datastore?

Yep sure will add (Edit: Actually it's only the 'nightly' suite that fails which only runs once a day, so I've just updated to not delete the datastore if it fails to we can inspect before cleaning up manually)

romange · 2024-09-09T17:39:24Z

how do you check if sync was completed? based on the "info" command?

andydunstall · 2024-09-09T17:45:12Z

yeah, once the replica is connected to the expected master and 'sync in progress' is false

andydunstall · 2024-09-19T15:47:22Z

This happened again (twice actually) and captured INFO this time
info.zip

Replica shows sync in progress for over a day:

# Replication
role:replica
master_host:10.0.37.98
master_port:9999
master_link_status:up
master_last_io_seconds_ago:98035
master_sync_in_progress:1
master_replid:8040121f00740ce4f57f695be5a82ce557cd56e4
slave_priority:100
slave_read_only:1

The datastore replica was stuck in this state for hours, again the latest replica logs shows:

I20240918 12:30:41.798735  1800 replica.cc:566] Started full sync with 10.0.37.98:9999
I20240918 12:34:24.742923  1799 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:27.340220  1800 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:32.594964  1801 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:38.525820  1795 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:35:11.956713  1796 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END

romange · 2024-09-19T17:01:06Z

Master shows 2 replicas. role:master connected_slaves:2 slave0:ip=10.0.34.241,port=6385,state=full_sync,lag=0 slave1:ip=10.0.41.27,port=6385,state=stable_sync,lag=0 master_replid:8040121f00740ce4f57f695be5a82ce557cd56e4 Do you happen to know which one got stuck?

…

On Thu, Sep 19, 2024 at 6:47 PM Andy Dunstall ***@***.***> wrote: This happened again and captured INFO this time info.zip <https://github.com/user-attachments/files/17062289/info.zip> The datastore replica was stuck in this state for hours, again the latest replica logs shows: I20240918 12:30:41.798735 1800 replica.cc:566] Started full sync with 10.0.37.98:9999 I20240918 12:34:24.742923 1799 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END I20240918 12:34:27.340220 1800 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END I20240918 12:34:32.594964 1801 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END I20240918 12:34:38.525820 1795 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END I20240918 12:35:11.956713 1796 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END — Reply to this email directly, view it on GitHub <#3679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4BFCCAT7VCFPNRSRH3PBLZXLW2DAVCNFSM6AAAAABN4SAWHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRRGM4DQNBRHA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Roman Gershman CTO --- *www.dragonflydb.io <http://www.dragonflydb.io>*

andydunstall · 2024-09-20T12:57:01Z

Do you happen to know which one got stuck?

10.0.34.241

It was in staging so we have full instance logs, metrics and state etc if it's useful

romange · 2024-09-20T16:33:59Z

yeah, they are useful. Please attach here.

andydunstall · 2024-09-22T06:18:37Z

Looking through control plane logs to get the exact sequence of events (as this only happens in one particular test).

Focusing on datastore dst_esfncx612, it starts with nodes:

node_a3wu4qd9x (A) (10.0.32.212)
node_va2ek40gc (B) (10.0.45.89)

Then is updated to updated to:

node_19wtbsfj5 (C) (10.0.43.24)
node_b8w20be2k (D) (10.0.43.59)

With steps:

Datastore created
Nodes A and B are ready, where A is master a B is a replica of A
Datastore populated with 75GB
Datastore updated which provisions C and D, where only C is configured as a replica of A (D waits for C to sync before also replicating)
Node A (the master) crashes (manually killed with SIGKILL to test datastore recovery during updates)
Node C is reconfigured as a replica of B
Node B is reconfigured as a master (note in this case node C is configured to replicate B before B is a master, not sure if that matters?)
Node A recovers as a replica of node B

Then node A syncs with B, but node C hangs.

Therefore

Node C is configured as a replica of B, before B is configured as a master
Node B has two parallel full syncs from both A and C (A syncs but C hangs)

Are both of those cases ok?

Rather than upload full logs and metrics here (which contains internal info), probably easiest to download with dfcloud? Quickly comparing logs of successful cases vs failed cases, the master (node B above) always seems to log rdb_save.cc:1271] Error writing to sink Input/output error in the failed case (checked the 3 recent error cases)

adiholden · 2024-09-22T06:50:36Z

We do not support replicating a replica (Node C is configured as a replica of B, before B is configured as a master).
I believe that the change in dragonfly should be that we will reply with error when running replicaof on host that is not master

andydunstall · 2024-09-22T07:17:34Z

We do not support replicating a replica

Ah ok thanks - will update control plane (FWIW the test often still succeed following the above steps, i.e. replicating a replica)

chakaz · 2024-09-24T08:48:48Z

Hi @andydunstall
Do you think it's safe to close this issue for now, and reopen it should it reoccur?
Or is there anything still pending that I've missed?

romange · 2024-09-24T08:54:54Z

@chakaz I think we should reject transitive replication from replica - if we do not support it.

andydunstall · 2024-09-24T08:56:33Z

Do you think it's safe to close this issue for now, and reopen it should it reoccur?

Yep sure I've patched control plane

adiholden · 2024-09-24T11:05:39Z

Hi @andydunstall Do you think it's safe to close this issue for now, and reopen it should it reoccur? Or is there anything still pending that I've missed?

@chakaz please see my comment to Andy above

chakaz · 2024-09-24T11:13:04Z

Yes, I'm on it!

We do not support connecting a replica to a replica, but before this PR we allowed doing so. This PR disables that behavior. Fixes #3679

* chore: Forbid replicating a replica We do not support connecting a replica to a replica, but before this PR we allowed doing so. This PR disables that behavior. Fixes #3679 * `replicaof_mu_`

andydunstall added the bug Something isn't working label Sep 9, 2024

adiholden assigned chakaz Sep 18, 2024

chakaz added a commit that referenced this issue Sep 24, 2024

chore: Forbid replicating a replica

a2085b1

We do not support connecting a replica to a replica, but before this PR we allowed doing so. This PR disables that behavior. Fixes #3679

chakaz mentioned this issue Sep 24, 2024

chore: Forbid replicating a replica #3779

Merged

chakaz closed this as completed in #3779 Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica hangs in full sync #3679

Replica hangs in full sync #3679

andydunstall commented Sep 9, 2024

romange commented Sep 9, 2024

andydunstall commented Sep 9, 2024 •

edited

Loading

romange commented Sep 9, 2024 •

edited

Loading

andydunstall commented Sep 9, 2024

andydunstall commented Sep 19, 2024 •

edited

Loading

romange commented Sep 19, 2024 via email

andydunstall commented Sep 20, 2024

romange commented Sep 20, 2024

andydunstall commented Sep 22, 2024

adiholden commented Sep 22, 2024

andydunstall commented Sep 22, 2024

chakaz commented Sep 24, 2024

romange commented Sep 24, 2024

andydunstall commented Sep 24, 2024

adiholden commented Sep 24, 2024

chakaz commented Sep 24, 2024

Replica hangs in full sync #3679

Replica hangs in full sync #3679

Comments

andydunstall commented Sep 9, 2024

romange commented Sep 9, 2024

andydunstall commented Sep 9, 2024 • edited Loading

romange commented Sep 9, 2024 • edited Loading

andydunstall commented Sep 9, 2024

andydunstall commented Sep 19, 2024 • edited Loading

romange commented Sep 19, 2024 via email

andydunstall commented Sep 20, 2024

romange commented Sep 20, 2024

andydunstall commented Sep 22, 2024

adiholden commented Sep 22, 2024

andydunstall commented Sep 22, 2024

chakaz commented Sep 24, 2024

romange commented Sep 24, 2024

andydunstall commented Sep 24, 2024

adiholden commented Sep 24, 2024

chakaz commented Sep 24, 2024

andydunstall commented Sep 9, 2024 •

edited

Loading

romange commented Sep 9, 2024 •

edited

Loading

andydunstall commented Sep 19, 2024 •

edited

Loading