-
Notifications
You must be signed in to change notification settings - Fork 23.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set replicas to panic on disk errors, and optionally panic on replication errors #10504
Conversation
i'm aware of at least one valid case for a command to fail on the replica (may no longer be valid), didn't look at the failed tests, maybe they show something similar to that. p.s. we already have a metric for these: |
it doesn't matter after 5.0, since we rewrite lua script as BTW, I don't think #10419 is a bug, in particular we have to do a lot of trivial but insignificant work to handle replication, the last arg win is just redis protocol style I think. |
Removed extra integration test Co-authored-by: Binbin <[email protected]>
As per @soloestoy's comment, I didn't think there were any "valid" cases of sending incompatible commands along the replication stream since we now only do effect replication. If there are others, I would still like to get something like this in (maybe with a default off). I'm just slightly worried about silent corruption.
|
I think the tests are me forgetting to tag debug stuff. |
well, i still fear that feature, i have a feeling there are several other cases that we're missing. but in anyway, at the very least, we should add 2 conditions to prevent it.
|
Ok, I'm okay with a config. I'm not really sure we need the check for Redis 7, even if a user is on a lower version, they have to be sending traffic that could be unintentionally divergent. I would argue it's better to resync on a potential issue that silently ignore a real one. |
a re-sync is not necessarily a solution. did you try to run the tests with |
I agree, it's possible it's not better, but it's also possibly much worse. I think a config that defaults to disabled for now at least provides a good middle ground we can change later. I would argue that disconnects are surfaced better than log messages at least. I will do the grep test when I update the PR a little bit later today. |
Today it's possible for a replica to not know about the correct topology of its master (by design, only masters are authoritative about slots), so we can't really do any validation besides checking for "cross-slot" operations. Although I originally advocated for it, I now think we should just skip that extra validation and blindly accept whatever the master sends us and apply it. |
A weird dangly space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have test coverage for repl_ignore_disk_write_error, the code seems pretty safe to merge without a test though..
Co-authored-by: Binbin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved by core-team meeting.
…efore a real write arrives from the master.
Co-authored-by: Binbin <[email protected]>
Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in redis#10504
Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in #10504
…tion errors (redis#10504) * Till now, replicas that were unable to persist, would still execute the commands they got from the master, now they'll panic by default, and we add a new `replica-ignore-disk-errors` config to change that. * Till now, when a command failed on a replica or AOF-loading, it only logged a warning and a stat, we add a new `propagation-error-behavior` config to allow panicking in that state (may become the default one day) Note that commands that fail on the replica can either indicate a bug that could cause data inconsistency between the replica and the master, or they could be in some cases (specifically in previous versions), a result of a command (e.g. EVAL) that failed on the master, but still had to be propagated to fail on the replica as well.
Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in redis#10504
replica-ignore-disk-write-errors
config to change that.propagation-error-behavior
config to allow panicking in that state (may become the default one day)Note that commands that fail on the replica can either indicate a bug that could cause data inconsistency between the replica and the master, or they could be in some cases (specifically in previous versions), a result of a command (e.g. EVAL) that failed on the master, but still had to be propagated to fail on the replica as well.
Background
Based off our conversations about data corruption (#10419), it seemed like maybe we should add some defense here. If a replica tries to apply an invalid command, this flag will crash the replica. This has been enabled in tests.
to do:
replica-ignore-disk-write-errors
)