Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checksum in replication RDB bulk transfer (slave crashed because of corrupted RDB ) #366

Closed
jokea opened this issue Mar 1, 2012 · 3 comments
Assignees
Milestone

Comments

@jokea
Copy link
Contributor

jokea commented Mar 1, 2012

Hi,

There are serveral slave instances crashed, and log shows:

[3243] 29 Feb 17:56:11 # MASTER time out: no data nor PING received...
[3243] 29 Feb 17:56:11 * Connecting to MASTER...
[3243] 29 Feb 17:56:11 * MASTER <-> SLAVE sync started
[3243] 29 Feb 17:56:11 * Non blocking connect for SYNC fired the event.
[3243] 29 Feb 17:56:25 * MASTER <-> SLAVE sync: receiving 448824372 bytes from master
[3243] 29 Feb 18:04:33 * MASTER <-> SLAVE sync: Loading DB in memory
[3243] 29 Feb 18:04:40 # === REDIS BUG REPORT START: Cut & paste starting from here ===
[3243] 29 Feb 18:04:40 # !!! Software Failure. Press left mouse button to continue
[3243] 29 Feb 18:04:40 # Guru Meditation: "Unknown RDB encoding type" #rdb.c:648
[3243] 29 Feb 18:04:40 # (forcing SIGSEGV in order to print the stack trace)
[3243] 29 Feb 18:04:40 #     Redis 2.4.6 crashed by signal: 11
[3243] 29 Feb 18:04:40 #     Failed assertion:  (:0)
[3243] 29 Feb 18:04:40 # --- STACK TRACE
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(_redisPanic+0x62) [0x7f58ea55c952]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(_redisPanic+0x62) [0x7f58ea55c952]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(rdbGenericLoadStringObject+0x7d) [0x7f58ea547b4d]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(rdbLoadObject+0x3b9) [0x7f58ea5491c9]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(rdbLoad+0x15e) [0x7f58ea5496ee]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(readSyncBulkPayload+0x1cc) [0x7f58ea54662c]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(aeProcessEvents+0x168) [0x7f58ea534af8]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(aeMain+0x2e) [0x7f58ea534d0e]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(main+0x1e8) [0x7f58ea538e95]
....

This happened when someone was adjusting the network which caused the link unstable, as
you can see, transfer 440MB dump took more than 8 minutes.

After the network got stable and we restarted the slave, it took less than 8 seconds and nothing went wrong:

[18680] 29 Feb 18:33:39 * SLAVE OF 10.110.24.15:7701 enabled (user request)
[18680] 29 Feb 18:33:41 * Connecting to MASTER...
[18680] 29 Feb 18:33:41 * MASTER <-> SLAVE sync started
[18680] 29 Feb 18:33:41 * Non blocking connect for SYNC fired the event.
[18680] 29 Feb 18:33:52 * MASTER <-> SLAVE sync: receiving 448826470 bytes from master
[18680] 29 Feb 18:33:59 * MASTER <-> SLAVE sync: Loading DB in memory
[18680] 29 Feb 18:34:09 * MASTER <-> SLAVE sync: Finished with success
...

I was thinking that append a MD5 sum of dump.rdb after it is created, which won't break compatibility.

@antirez
Copy link
Contributor

antirez commented Mar 1, 2012

Hello Jokea, the checksum looks like a good idea indeed, but I wonder if we can be sure that the file was corrupted because of the transferring, and was not generated in the wrong way for some reason.

Btw you could say I guess, that having the checksum may already provide a reply about this...

@ghost ghost assigned antirez Mar 1, 2012
@jokea
Copy link
Contributor Author

jokea commented Mar 2, 2012

eh... I should've make a backup of the dump file generated by the master, then we can see where the problem is. Now this is really hard to reproduce.

There are also 3 slaves that stopped replication while the master <-> slave link is up. We noticed the desync because the number of keys stopped changing as the master:

master_host:10.77.15.20
master_port:8801
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0

I checked the client status from one of the slaves and got this:

addr=10.77.15.20:8801 fd=8 idle=0 flags=M db=0 sub=0 psub=0 qbuf=415425449 obl=0 oll=0 events=r cmd=NULL

Since we already fixed the protocol desync bug in issue #141, I think this was caused by the network transfer too. We have 7 slaves attached, and these 3 desync slaves are in a different location as the master, the network adjustment only affected the link between the two locations.

@antirez
Copy link
Contributor

antirez commented Apr 9, 2012

Hi @jokea, I went forward and added checksum directly in the RDB format itself, so this will protect replication but also any other uses of the RDB format, especially just loading an RDB file after a restart. For now the changes are in the 'rdbcksum' branch but will be merged into unstable and 2.6 soon. Thanks! Closing.

@antirez antirez closed this as completed Apr 9, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants