Add checksum in replication RDB bulk transfer (slave crashed because of corrupted RDB ) #366

jokea · 2012-03-01T03:28:21Z

Hi,

There are serveral slave instances crashed, and log shows:

[3243] 29 Feb 17:56:11 # MASTER time out: no data nor PING received...
[3243] 29 Feb 17:56:11 * Connecting to MASTER...
[3243] 29 Feb 17:56:11 * MASTER <-> SLAVE sync started
[3243] 29 Feb 17:56:11 * Non blocking connect for SYNC fired the event.
[3243] 29 Feb 17:56:25 * MASTER <-> SLAVE sync: receiving 448824372 bytes from master
[3243] 29 Feb 18:04:33 * MASTER <-> SLAVE sync: Loading DB in memory
[3243] 29 Feb 18:04:40 # === REDIS BUG REPORT START: Cut & paste starting from here ===
[3243] 29 Feb 18:04:40 # !!! Software Failure. Press left mouse button to continue
[3243] 29 Feb 18:04:40 # Guru Meditation: "Unknown RDB encoding type" #rdb.c:648
[3243] 29 Feb 18:04:40 # (forcing SIGSEGV in order to print the stack trace)
[3243] 29 Feb 18:04:40 #     Redis 2.4.6 crashed by signal: 11
[3243] 29 Feb 18:04:40 #     Failed assertion:  (:0)
[3243] 29 Feb 18:04:40 # --- STACK TRACE
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(_redisPanic+0x62) [0x7f58ea55c952]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(_redisPanic+0x62) [0x7f58ea55c952]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(rdbGenericLoadStringObject+0x7d) [0x7f58ea547b4d]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(rdbLoadObject+0x3b9) [0x7f58ea5491c9]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(rdbLoad+0x15e) [0x7f58ea5496ee]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(readSyncBulkPayload+0x1cc) [0x7f58ea54662c]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(aeProcessEvents+0x168) [0x7f58ea534af8]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(aeMain+0x2e) [0x7f58ea534d0e]
[3243] 29 Feb 18:04:40 # /usr/local/bin/redis-server(main+0x1e8) [0x7f58ea538e95]
....

This happened when someone was adjusting the network which caused the link unstable, as
you can see, transfer 440MB dump took more than 8 minutes.

After the network got stable and we restarted the slave, it took less than 8 seconds and nothing went wrong:

[18680] 29 Feb 18:33:39 * SLAVE OF 10.110.24.15:7701 enabled (user request)
[18680] 29 Feb 18:33:41 * Connecting to MASTER...
[18680] 29 Feb 18:33:41 * MASTER <-> SLAVE sync started
[18680] 29 Feb 18:33:41 * Non blocking connect for SYNC fired the event.
[18680] 29 Feb 18:33:52 * MASTER <-> SLAVE sync: receiving 448826470 bytes from master
[18680] 29 Feb 18:33:59 * MASTER <-> SLAVE sync: Loading DB in memory
[18680] 29 Feb 18:34:09 * MASTER <-> SLAVE sync: Finished with success
...

I was thinking that append a MD5 sum of dump.rdb after it is created, which won't break compatibility.

antirez · 2012-03-01T15:19:56Z

Hello Jokea, the checksum looks like a good idea indeed, but I wonder if we can be sure that the file was corrupted because of the transferring, and was not generated in the wrong way for some reason.

Btw you could say I guess, that having the checksum may already provide a reply about this...

jokea · 2012-03-02T01:45:17Z

eh... I should've make a backup of the dump file generated by the master, then we can see where the problem is. Now this is really hard to reproduce.

There are also 3 slaves that stopped replication while the master <-> slave link is up. We noticed the desync because the number of keys stopped changing as the master:

master_host:10.77.15.20
master_port:8801
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0

I checked the client status from one of the slaves and got this:

addr=10.77.15.20:8801 fd=8 idle=0 flags=M db=0 sub=0 psub=0 qbuf=415425449 obl=0 oll=0 events=r cmd=NULL

Since we already fixed the protocol desync bug in issue #141, I think this was caused by the network transfer too. We have 7 slaves attached, and these 3 desync slaves are in a different location as the master, the network adjustment only affected the link between the two locations.

antirez · 2012-04-09T21:19:07Z

Hi @jokea, I went forward and added checksum directly in the RDB format itself, so this will protect replication but also any other uses of the RDB format, especially just loading an RDB file after a restart. For now the changes are in the 'rdbcksum' branch but will be merged into unstable and 2.6 soon. Thanks! Closing.

ghost assigned antirez Mar 1, 2012

antirez closed this as completed Apr 9, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checksum in replication RDB bulk transfer (slave crashed because of corrupted RDB ) #366

Add checksum in replication RDB bulk transfer (slave crashed because of corrupted RDB ) #366

jokea commented Mar 1, 2012

antirez commented Mar 1, 2012

jokea commented Mar 2, 2012

antirez commented Apr 9, 2012

Add checksum in replication RDB bulk transfer (slave crashed because of corrupted RDB ) #366

Add checksum in replication RDB bulk transfer (slave crashed because of corrupted RDB ) #366

Comments

jokea commented Mar 1, 2012

antirez commented Mar 1, 2012

jokea commented Mar 2, 2012

antirez commented Apr 9, 2012