Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring a snapshot destroys history between the restored primary and existing replicas #26544

Closed
jasontedor opened this issue Sep 8, 2017 · 1 comment
Assignees
Labels
blocker :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.0.0-rc1

Comments

@jasontedor
Copy link
Member

Restoring a snapshot means that history on any replicas is no longer valid. Without a way to detect this situation, we can end up with a primary divergent from its replicas. We will add a new history UUID to address this situation.

jasontedor added a commit that referenced this issue Sep 8, 2017
This commit removes a norelease from the codebase now that there is a CI
job that fails on the norelease pattern being present. Instead, a new
issue has been opened to track this one.

Relates #26544
jasontedor added a commit that referenced this issue Sep 8, 2017
This commit removes a norelease from the codebase now that there is a CI
job that fails on the norelease pattern being present. Instead, a new
issue has been opened to track this one.

Relates #26544
jasontedor added a commit that referenced this issue Sep 8, 2017
This commit removes a norelease from the codebase now that there is a CI
job that fails on the norelease pattern being present. Instead, a new
issue has been opened to track this one.

Relates #26544
@ywelsch
Copy link
Contributor

ywelsch commented Sep 13, 2017

Note that the same issue applies when force-allocating an empty / stale primary using the reroute commands.

bleskes added a commit that referenced this issue Sep 19, 2017
…#26694)

Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore.

As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid.

Relates #10708
Closes #26544
bleskes added a commit that referenced this issue Sep 19, 2017
…#26694)

Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore.

As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid.

Relates #10708
Closes #26544
bleskes added a commit that referenced this issue Sep 19, 2017
…#26694)

Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore.

As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid.

Relates #10708
Closes #26544
@clintongormley clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.0.0-rc1
Projects
None yet
Development

No branches or pull requests

5 participants