Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica hangs in full sync #3679

Closed
andydunstall opened this issue Sep 9, 2024 · 16 comments · Fixed by #3779
Closed

Replica hangs in full sync #3679

andydunstall opened this issue Sep 9, 2024 · 16 comments · Fixed by #3779
Assignees
Labels
bug Something isn't working

Comments

@andydunstall
Copy link
Contributor

Describe the bug
We sometimes see replicas getting stuck in full sync as part of our test suite

Such as the database items graphs show:
image

Where the replica gets most of the keys but they just hangs without completing the sync

The replica logs show:

I20240909 12:02:25.220778  1843 replica.cc:566] Started full sync with 10.0.43.58:9999
I20240909 12:04:46.989950  1843 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240909 12:06:08.022770  1842 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240909 12:06:23.528414  1845 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240909 12:06:40.811838  1841 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240909 12:07:00.670682  1847 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END

(5 minutes pass before we delete the datastore and replica disconnects)

I20240909 12:11:54.027276  1844 rdb_load.cc:1141] Error while calling src_->Read(mb)
I20240909 12:11:54.027280  1843 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027284  1845 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027343  1844 rdb_load.cc:2549] ReadObj error system:103 for key test:61158666
I20240909 12:11:54.027280  1840 rdb_load.cc:1160] Error while calling src_->ReadAtLeast(mb, size)
I20240909 12:11:54.027284  1846 rdb_load.cc:1141] Error while calling src_->Read(mb)
I20240909 12:11:54.027350  1843 rdb_load.cc:1999] Error while calling FetchType()
I20240909 12:11:54.027383  1840 rdb_load.cc:2549] ReadObj error system:103 for key test:23694841
I20240909 12:11:54.027402  1842 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027402  1846 rdb_load.cc:2549] ReadObj error system:103 for key test:72764920
I20240909 12:11:54.027292  1841 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027352  1845 rdb_load.cc:1999] Error while calling FetchType()
I20240909 12:11:54.027402  1847 rdb_load.cc:2205] Error reading from source: system:103 1 bytes
I20240909 12:11:54.027453  1842 rdb_load.cc:1999] Error while calling FetchType()
I20240909 12:11:54.027494  1847 rdb_load.cc:1999] Error while calling FetchType()
I20240909 12:11:54.027479  1841 rdb_load.cc:1999] Error while calling FetchType()
W20240909 12:11:54.027691  1843 replica.cc:243] Error syncing with 10.0.43.58:9999 system:103 Software caused connection abort

So it seems to get all the keys from the full sync but never transitions to stable sync?

(I can send full datastore logs over if needed)

To Reproduce
I don't have a reliable way to reproduce

This sometimes happens in our test case, where we populate a datastore with two replicas with 75m keys (~75GB), then kill the master (SIGKILL) to one of the replicas is promoted to master and the other becomes a replica of the new master

The new replica connects to the new master, but then hangs as described above

It probably isn't a lot to go on, but we can enable any logs you suggest if it helps debug the issue

Environment (please complete the following information):

  • Ubuntu AWS x2gd.2xlarge
  • Dragonfly v1.22.0
@andydunstall andydunstall added the bug Something isn't working label Sep 9, 2024
@romange
Copy link
Collaborator

romange commented Sep 9, 2024

how do you know it hangs? maybe it's master that hangs?
in any case, is it possible to call uncoditionally "info all" on both master and replica before deleting the datastore?
and then of course printing both responses into the test logs.

@andydunstall
Copy link
Contributor Author

andydunstall commented Sep 9, 2024

how do you know it hangs? maybe it's master that hangs?

I don't know - just it never completes the sync

in any case, is it possible to call uncoditionally "info all" on both master and replica before deleting the datastore?

Yep sure will add (Edit: Actually it's only the 'nightly' suite that fails which only runs once a day, so I've just updated to not delete the datastore if it fails to we can inspect before cleaning up manually)

@romange
Copy link
Collaborator

romange commented Sep 9, 2024

how do you check if sync was completed? based on the "info" command?

@andydunstall
Copy link
Contributor Author

yeah, once the replica is connected to the expected master and 'sync in progress' is false

@andydunstall
Copy link
Contributor Author

andydunstall commented Sep 19, 2024

This happened again (twice actually) and captured INFO this time
info.zip

Replica shows sync in progress for over a day:

# Replication
role:replica
master_host:10.0.37.98
master_port:9999
master_link_status:up
master_last_io_seconds_ago:98035
master_sync_in_progress:1
master_replid:8040121f00740ce4f57f695be5a82ce557cd56e4
slave_priority:100
slave_read_only:1

The datastore replica was stuck in this state for hours, again the latest replica logs shows:

I20240918 12:30:41.798735  1800 replica.cc:566] Started full sync with 10.0.37.98:9999
I20240918 12:34:24.742923  1799 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:27.340220  1800 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:32.594964  1801 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:34:38.525820  1795 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END
I20240918 12:35:11.956713  1796 rdb_load.cc:2050] Read RDB_OPCODE_FULLSYNC_END

@romange
Copy link
Collaborator

romange commented Sep 19, 2024 via email

@andydunstall
Copy link
Contributor Author

Do you happen to know which one got stuck?

10.0.34.241

It was in staging so we have full instance logs, metrics and state etc if it's useful

@romange
Copy link
Collaborator

romange commented Sep 20, 2024

yeah, they are useful. Please attach here.

@andydunstall
Copy link
Contributor Author

Looking through control plane logs to get the exact sequence of events (as this only happens in one particular test).

Focusing on datastore dst_esfncx612, it starts with nodes:

  • node_a3wu4qd9x (A) (10.0.32.212)
  • node_va2ek40gc (B) (10.0.45.89)

Then is updated to updated to:

  • node_19wtbsfj5 (C) (10.0.43.24)
  • node_b8w20be2k (D) (10.0.43.59)

With steps:

  1. Datastore created
  2. Nodes A and B are ready, where A is master a B is a replica of A
  3. Datastore populated with 75GB
  4. Datastore updated which provisions C and D, where only C is configured as a replica of A (D waits for C to sync before also replicating)
  5. Node A (the master) crashes (manually killed with SIGKILL to test datastore recovery during updates)
  6. Node C is reconfigured as a replica of B
  7. Node B is reconfigured as a master (note in this case node C is configured to replicate B before B is a master, not sure if that matters?)
  8. Node A recovers as a replica of node B

Then node A syncs with B, but node C hangs.

Therefore

  • Node C is configured as a replica of B, before B is configured as a master
  • Node B has two parallel full syncs from both A and C (A syncs but C hangs)

Are both of those cases ok?

Rather than upload full logs and metrics here (which contains internal info), probably easiest to download with dfcloud? Quickly comparing logs of successful cases vs failed cases, the master (node B above) always seems to log rdb_save.cc:1271] Error writing to sink Input/output error in the failed case (checked the 3 recent error cases)

@adiholden
Copy link
Collaborator

We do not support replicating a replica (Node C is configured as a replica of B, before B is configured as a master).
I believe that the change in dragonfly should be that we will reply with error when running replicaof on host that is not master

@andydunstall
Copy link
Contributor Author

We do not support replicating a replica

Ah ok thanks - will update control plane (FWIW the test often still succeed following the above steps, i.e. replicating a replica)

@chakaz
Copy link
Collaborator

chakaz commented Sep 24, 2024

Hi @andydunstall
Do you think it's safe to close this issue for now, and reopen it should it reoccur?
Or is there anything still pending that I've missed?

@romange
Copy link
Collaborator

romange commented Sep 24, 2024

@chakaz I think we should reject transitive replication from replica - if we do not support it.

@andydunstall
Copy link
Contributor Author

Do you think it's safe to close this issue for now, and reopen it should it reoccur?

Yep sure I've patched control plane

@adiholden
Copy link
Collaborator

Hi @andydunstall Do you think it's safe to close this issue for now, and reopen it should it reoccur? Or is there anything still pending that I've missed?

@chakaz please see my comment to Andy above

@chakaz
Copy link
Collaborator

chakaz commented Sep 24, 2024

Yes, I'm on it!

chakaz added a commit that referenced this issue Sep 24, 2024
We do not support connecting a replica to a replica, but before this PR
we allowed doing so. This PR disables that behavior.

Fixes #3679
chakaz added a commit that referenced this issue Sep 24, 2024
* chore: Forbid replicating a replica

We do not support connecting a replica to a replica, but before this PR
we allowed doing so. This PR disables that behavior.

Fixes #3679

* `replicaof_mu_`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants