Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set replicas to panic on disk errors, and optionally panic on replication errors #10504

Merged
merged 25 commits into from
Apr 26, 2022

Conversation

madolson
Copy link
Contributor

@madolson madolson commented Apr 1, 2022

  • Till now, replicas that were unable to persist, would still execute the commands they got from the master, now they'll panic by default, and we add a new replica-ignore-disk-write-errors config to change that.
  • Till now, when a command failed on a replica or AOF-loading, it only logged a warning and a stat, we add a new propagation-error-behavior config to allow panicking in that state (may become the default one day)

Note that commands that fail on the replica can either indicate a bug that could cause data inconsistency between the replica and the master, or they could be in some cases (specifically in previous versions), a result of a command (e.g. EVAL) that failed on the master, but still had to be propagated to fail on the replica as well.

Background

Based off our conversations about data corruption (#10419), it seemed like maybe we should add some defense here. If a replica tries to apply an invalid command, this flag will crash the replica. This has been enabled in tests.

to do:

  • consider removing the check added here d3b4662 and panicking instead (either depending on the new config, or always). (Originally we thought this was an issue, but it doesn't seem like this makes sense since primaries/replicas can disagree on slot ownership)
  • consider panicking on disk errors in replica (possibly using another new config: replica-ignore-disk-write-errors)

tests/test_helper.tcl Outdated Show resolved Hide resolved
@oranagra
Copy link
Member

oranagra commented Apr 1, 2022

i'm aware of at least one valid case for a command to fail on the replica (may no longer be valid),
this was when a script was executed on the master and failed half way (after making some modifications), that script must have been propagated to the replica to fail there as well)
@guybe7 @soloestoy you may have other examples.

didn't look at the failed tests, maybe they show something similar to that.

p.s. we already have a metric for these: stat_unexpected_error_replies

@soloestoy
Copy link
Collaborator

soloestoy commented Apr 1, 2022

i'm aware of at least one valid case for a command to fail on the replica (may no longer be valid),
this was when a script was executed on the master and failed half way (after making some modifications), that script must have been propagated to the replica to fail there as well)

it doesn't matter after 5.0, since we rewrite lua script as MULTI/EXEC by default, but indeed that's a problem when upgrading from old version redis.

BTW, I don't think #10419 is a bug, in particular we have to do a lot of trivial but insignificant work to handle replication, the last arg win is just redis protocol style I think.

Removed extra integration test

Co-authored-by: Binbin <[email protected]>
@madolson
Copy link
Contributor Author

madolson commented Apr 1, 2022

As per @soloestoy's comment, I didn't think there were any "valid" cases of sending incompatible commands along the replication stream since we now only do effect replication. If there are others, I would still like to get something like this in (maybe with a default off). I'm just slightly worried about silent corruption.

stat_unexpected_error_replies should cover it, I'm pretty sure I saw that and then promptly forgot while publishing the PR.

@madolson
Copy link
Contributor Author

madolson commented Apr 1, 2022

I think the tests are me forgetting to tag debug stuff.

@oranagra
Copy link
Member

oranagra commented Apr 3, 2022

well, i still fear that feature, i have a feeling there are several other cases that we're missing.
we need to conduct some deeper research to find it.

but in anyway, at the very least, we should add 2 conditions to prevent it.

  1. detect the version of the master, and avoid this if the master is a lower version (or at least lower than 7)
  2. maybe a config (possibly disabled by default)

@madolson
Copy link
Contributor Author

madolson commented Apr 4, 2022

Ok, I'm okay with a config. I'm not really sure we need the check for Redis 7, even if a user is on a lower version, they have to be sending traffic that could be unintentionally divergent. I would argue it's better to resync on a potential issue that silently ignore a real one.

@oranagra
Copy link
Member

oranagra commented Apr 4, 2022

a re-sync is not necessarily a solution.
in a certain use case (for instance one with an EVAL that fails half way), could be facing repeated re-connection, one every second.
you can argue that once we have a config for that the user has a way out, and also that script propagation was disabled by default already anyway, but maybe there are other cases....

did you try to run the tests with --dont-clean and grep for these warnings to see if we have other issues like that?

@madolson
Copy link
Contributor Author

madolson commented Apr 4, 2022

I agree, it's possible it's not better, but it's also possibly much worse. I think a config that defaults to disabled for now at least provides a good middle ground we can change later. I would argue that disconnects are surfaced better than log messages at least. I will do the grep test when I update the PR a little bit later today.

src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/config.c Outdated Show resolved Hide resolved
@madolson madolson changed the title Force resync from master when data corruption is detected on replica Add config to allow replicas to panic on replication errors Apr 8, 2022
src/networking.c Outdated Show resolved Hide resolved
tests/integration/replication-4.tcl Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
redis.conf Outdated Show resolved Hide resolved
redis.conf Outdated Show resolved Hide resolved
src/config.c Show resolved Hide resolved
src/server.h Outdated Show resolved Hide resolved
@madolson
Copy link
Contributor Author

Today it's possible for a replica to not know about the correct topology of its master (by design, only masters are authoritative about slots), so we can't really do any validation besides checking for "cross-slot" operations. Although I originally advocated for it, I now think we should just skip that extra validation and blindly accept whatever the master sends us and apply it.

redis.conf Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/config.c Outdated Show resolved Hide resolved
A weird dangly space.
redis.conf Outdated Show resolved Hide resolved
redis.conf Show resolved Hide resolved
redis.conf Outdated Show resolved Hide resolved
src/server.c Show resolved Hide resolved
@oranagra oranagra changed the title Add config to allow replicas to panic on replication errors Set replicas to panic on disk errors, and optionally panic on replication errors Apr 26, 2022
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have test coverage for repl_ignore_disk_write_error, the code seems pretty safe to merge without a test though..

src/server.c Outdated Show resolved Hide resolved
Co-authored-by: Binbin <[email protected]>
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved by core-team meeting.

@oranagra oranagra added the breaking-change This change can potentially break existing application label Apr 26, 2022
redis.conf Outdated Show resolved Hide resolved
oranagra and others added 2 commits April 26, 2022 12:33
@oranagra oranagra merged commit 6fa8e4f into redis:unstable Apr 26, 2022
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Apr 26, 2022
Missing a typeof, we will get errors like this:
- multiple definition of `replicationErrorBehavior'
- ld: error: duplicate symbol: replicationErrorBehavior

Introduced in redis#10504
oranagra pushed a commit that referenced this pull request Apr 26, 2022
Missing a typeof, we will get errors like this:
- multiple definition of `replicationErrorBehavior'
- ld: error: duplicate symbol: replicationErrorBehavior

Introduced in #10504
@oranagra oranagra mentioned this pull request Apr 27, 2022
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…tion errors (redis#10504)

* Till now, replicas that were unable to persist, would still execute the commands
  they got from the master, now they'll panic by default, and we add a new
  `replica-ignore-disk-errors` config to change that.
* Till now, when a command failed on a replica or AOF-loading, it only logged a
  warning and a stat, we add a new `propagation-error-behavior` config to allow
  panicking in that state (may become the default one day)

Note that commands that fail on the replica can either indicate a bug that could
cause data inconsistency between the replica and the master, or they could be
in some cases (specifically in previous versions), a result of a command (e.g. EVAL)
that failed on the master, but still had to be propagated to fail on the replica as well.
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
Missing a typeof, we will get errors like this:
- multiple definition of `replicationErrorBehavior'
- ld: error: duplicate symbol: replicationErrorBehavior

Introduced in redis#10504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval-needed Waiting for core team approval to be merged breaking-change This change can potentially break existing application release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants