Set replicas to panic on disk errors, and optionally panic on replication errors #10504

madolson · 2022-04-01T03:45:18Z

Till now, replicas that were unable to persist, would still execute the commands they got from the master, now they'll panic by default, and we add a new replica-ignore-disk-write-errors config to change that.
Till now, when a command failed on a replica or AOF-loading, it only logged a warning and a stat, we add a new propagation-error-behavior config to allow panicking in that state (may become the default one day)

Note that commands that fail on the replica can either indicate a bug that could cause data inconsistency between the replica and the master, or they could be in some cases (specifically in previous versions), a result of a command (e.g. EVAL) that failed on the master, but still had to be propagated to fail on the replica as well.

Background

Based off our conversations about data corruption (#10419), it seemed like maybe we should add some defense here. If a replica tries to apply an invalid command, this flag will crash the replica. This has been enabled in tests.

to do:

consider removing the check added here d3b4662 and panicking instead (either depending on the new config, or always). (Originally we thought this was an issue, but it doesn't seem like this makes sense since primaries/replicas can disagree on slot ownership)
consider panicking on disk errors in replica (possibly using another new config: replica-ignore-disk-write-errors)

tests/test_helper.tcl

oranagra · 2022-04-01T09:01:39Z

i'm aware of at least one valid case for a command to fail on the replica (may no longer be valid),
this was when a script was executed on the master and failed half way (after making some modifications), that script must have been propagated to the replica to fail there as well)
@guybe7 @soloestoy you may have other examples.

didn't look at the failed tests, maybe they show something similar to that.

p.s. we already have a metric for these: stat_unexpected_error_replies

soloestoy · 2022-04-01T10:02:57Z

i'm aware of at least one valid case for a command to fail on the replica (may no longer be valid),
this was when a script was executed on the master and failed half way (after making some modifications), that script must have been propagated to the replica to fail there as well)

it doesn't matter after 5.0, since we rewrite lua script as MULTI/EXEC by default, but indeed that's a problem when upgrading from old version redis.

BTW, I don't think #10419 is a bug, in particular we have to do a lot of trivial but insignificant work to handle replication, the last arg win is just redis protocol style I think.

Removed extra integration test Co-authored-by: Binbin <[email protected]>

madolson · 2022-04-01T15:30:50Z

As per @soloestoy's comment, I didn't think there were any "valid" cases of sending incompatible commands along the replication stream since we now only do effect replication. If there are others, I would still like to get something like this in (maybe with a default off). I'm just slightly worried about silent corruption.

stat_unexpected_error_replies should cover it, I'm pretty sure I saw that and then promptly forgot while publishing the PR.

madolson · 2022-04-01T15:31:41Z

I think the tests are me forgetting to tag debug stuff.

oranagra · 2022-04-03T10:59:03Z

well, i still fear that feature, i have a feeling there are several other cases that we're missing.
we need to conduct some deeper research to find it.

but in anyway, at the very least, we should add 2 conditions to prevent it.

detect the version of the master, and avoid this if the master is a lower version (or at least lower than 7)
maybe a config (possibly disabled by default)

madolson · 2022-04-04T15:31:33Z

Ok, I'm okay with a config. I'm not really sure we need the check for Redis 7, even if a user is on a lower version, they have to be sending traffic that could be unintentionally divergent. I would argue it's better to resync on a potential issue that silently ignore a real one.

oranagra · 2022-04-04T15:37:09Z

a re-sync is not necessarily a solution.
in a certain use case (for instance one with an EVAL that fails half way), could be facing repeated re-connection, one every second.
you can argue that once we have a config for that the user has a way out, and also that script propagation was disabled by default already anyway, but maybe there are other cases....

did you try to run the tests with --dont-clean and grep for these warnings to see if we have other issues like that?

madolson · 2022-04-04T18:29:14Z

I agree, it's possible it's not better, but it's also possibly much worse. I think a config that defaults to disabled for now at least provides a good middle ground we can change later. I would argue that disconnects are surfaced better than log messages at least. I will do the grep test when I update the PR a little bit later today.

src/networking.c

src/config.c

src/networking.c

tests/integration/replication-4.tcl

src/networking.c

redis.conf

src/config.c

src/server.h

madolson · 2022-04-26T04:10:34Z

Today it's possible for a replica to not know about the correct topology of its master (by design, only masters are authoritative about slots), so we can't really do any validation besides checking for "cross-slot" operations. Although I originally advocated for it, I now think we should just skip that extra validation and blindly accept whatever the master sends us and apply it.

redis.conf

src/server.c

src/config.c

A weird dangly space.

redis.conf

src/server.c

oranagra

we don't have test coverage for repl_ignore_disk_write_error, the code seems pretty safe to merge without a test though..

src/server.c

Co-authored-by: Binbin <[email protected]>

oranagra

approved by core-team meeting.

redis.conf

…efore a real write arrives from the master.

Co-authored-by: Binbin <[email protected]>

Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in redis#10504

Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in #10504

…tion errors (redis#10504) * Till now, replicas that were unable to persist, would still execute the commands they got from the master, now they'll panic by default, and we add a new `replica-ignore-disk-errors` config to change that. * Till now, when a command failed on a replica or AOF-loading, it only logged a warning and a stat, we add a new `propagation-error-behavior` config to allow panicking in that state (may become the default one day) Note that commands that fail on the replica can either indicate a bug that could cause data inconsistency between the replica and the master, or they could be in some cases (specifically in previous versions), a result of a command (e.g. EVAL) that failed on the master, but still had to be propagated to fail on the replica as well.

Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in redis#10504

madolson added 3 commits March 31, 2022 19:52

Resync with replica on error

9cb1f2d

Force fullsync on data corruption

24ec261

Wording

b9696de

enjoy-binbin reviewed Apr 1, 2022

View reviewed changes

tests/test_helper.tcl Outdated Show resolved Hide resolved

Update tests/test_helper.tcl

2498dff

Removed extra integration test Co-authored-by: Binbin <[email protected]>

Add correct tags to replication test

dd9c8a2

madolson added 2 commits April 4, 2022 22:00

addressed comments

2ce457b

I always mispell received

40886da

madolson commented Apr 5, 2022

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/config.c Outdated Show resolved Hide resolved

madolson added 3 commits April 8, 2022 14:55

Updated PR

327d074

typos

5a476a2

more typos

2e4a26f

madolson changed the title ~~Force resync from master when data corruption is detected on replica~~ Add config to allow replicas to panic on replication errors Apr 8, 2022

Improve wording

fd6694c

oranagra reviewed Apr 10, 2022

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

tests/integration/replication-4.tcl Show resolved Hide resolved

guybe7 reviewed Apr 11, 2022

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

Changed wording and added more information to the panic

ac64b6b

oranagra reviewed Apr 12, 2022

View reviewed changes

redis.conf Outdated Show resolved Hide resolved

redis.conf Outdated Show resolved Hide resolved

src/config.c Show resolved Hide resolved

src/server.h Outdated Show resolved Hide resolved

madolson added 2 commits April 11, 2022 22:58

Forgot to update config name in tests

c5a6cb9

Addressed comments

556f7c6

oranagra approved these changes Apr 12, 2022

View reviewed changes

madolson added 2 commits April 25, 2022 21:05

Merge branch 'unstable' into resync-on-corruption

68204f2

Fixed merge issue

b07929f

Removed some extra files and improved wording

54d4927

oranagra reviewed Apr 26, 2022

View reviewed changes

redis.conf Outdated Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

Addressed comments

f9521ac

madolson commented Apr 26, 2022

View reviewed changes

src/config.c Outdated Show resolved Hide resolved

Update src/config.c

353ee61

A weird dangly space.

enjoy-binbin reviewed Apr 26, 2022

View reviewed changes

redis.conf Outdated Show resolved Hide resolved

redis.conf Show resolved Hide resolved

redis.conf Outdated Show resolved Hide resolved

oranagra approved these changes Apr 26, 2022

View reviewed changes

src/server.c Show resolved Hide resolved

oranagra changed the title ~~Add config to allow replicas to panic on replication errors~~ Set replicas to panic on disk errors, and optionally panic on replication errors Apr 26, 2022

oranagra reviewed Apr 26, 2022

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

typos

29d95f2

Co-authored-by: Binbin <[email protected]>

oranagra approved these changes Apr 26, 2022

View reviewed changes

oranagra added the breaking-change This change can potentially break existing application label Apr 26, 2022

enjoy-binbin reviewed Apr 26, 2022

View reviewed changes

redis.conf Outdated Show resolved Hide resolved

oranagra and others added 2 commits April 26, 2022 12:33

Let PING pass on disk error, maybe the disk error will be recovered b…

f70065a

…efore a real write arrives from the master.

trim white space and EOL

8d4f2b8

Co-authored-by: Binbin <[email protected]>

yossigo approved these changes Apr 26, 2022

View reviewed changes

oranagra merged commit 6fa8e4f into redis:unstable Apr 26, 2022

enjoy-binbin mentioned this pull request Apr 26, 2022

Fix syntax error in replicationErrorBehavior enum #10642

Merged

oranagra mentioned this pull request Apr 27, 2022

Redis 7.0.0 #10652

Merged

zuiderkwast mentioned this pull request Apr 27, 2022

Improve slot migration reliability #10517

Closed

dhedberg mentioned this pull request Jun 12, 2023

redis-server does not work properly in the latest image after update to debian 12 greenbone/greenbone-container-images#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set replicas to panic on disk errors, and optionally panic on replication errors #10504

Set replicas to panic on disk errors, and optionally panic on replication errors #10504

madolson commented Apr 1, 2022 •

edited by enjoy-binbin

Loading

oranagra commented Apr 1, 2022

soloestoy commented Apr 1, 2022 •

edited

Loading

madolson commented Apr 1, 2022

madolson commented Apr 1, 2022

oranagra commented Apr 3, 2022

madolson commented Apr 4, 2022

oranagra commented Apr 4, 2022

madolson commented Apr 4, 2022

madolson commented Apr 26, 2022

oranagra left a comment

oranagra left a comment

Set replicas to panic on disk errors, and optionally panic on replication errors #10504

Set replicas to panic on disk errors, and optionally panic on replication errors #10504

Conversation

madolson commented Apr 1, 2022 • edited by enjoy-binbin Loading

Background

oranagra commented Apr 1, 2022

soloestoy commented Apr 1, 2022 • edited Loading

madolson commented Apr 1, 2022

madolson commented Apr 1, 2022

oranagra commented Apr 3, 2022

madolson commented Apr 4, 2022

oranagra commented Apr 4, 2022

madolson commented Apr 4, 2022

madolson commented Apr 26, 2022

oranagra left a comment

Choose a reason for hiding this comment

oranagra left a comment

Choose a reason for hiding this comment

madolson commented Apr 1, 2022 •

edited by enjoy-binbin

Loading

soloestoy commented Apr 1, 2022 •

edited

Loading