Restoring a snapshot destroys history between the restored primary and existing replicas #26544

jasontedor · 2017-09-08T01:56:42Z

Restoring a snapshot means that history on any replicas is no longer valid. Without a way to detect this situation, we can end up with a primary divergent from its replicas. We will add a new history UUID to address this situation.

This commit removes a norelease from the codebase now that there is a CI job that fails on the norelease pattern being present. Instead, a new issue has been opened to track this one. Relates #26544

ywelsch · 2017-09-13T08:11:40Z

Note that the same issue applies when force-allocating an empty / stale primary using the reroute commands.

@ywelsch

…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544

@ywelsch

…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544

@ywelsch

…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544

jasontedor added :Sequence IDs blocker v6.0.0 labels Sep 8, 2017

jasontedor assigned bleskes Sep 8, 2017

bleskes mentioned this issue Sep 18, 2017

Restoring from snapshot should force generation of a new history uuid #26694

Merged

bleskes closed this as completed in #26694 Sep 19, 2017

imotov mentioned this issue Sep 19, 2017

[CI] SharedClusterSnapshotRestoreIT.testBasicWorkFlow failure #26436

Closed

colings86 added v6.0.0-rc1 and removed v6.0.0 labels Sep 22, 2017

clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restoring a snapshot destroys history between the restored primary and existing replicas #26544

Restoring a snapshot destroys history between the restored primary and existing replicas #26544

jasontedor commented Sep 8, 2017

ywelsch commented Sep 13, 2017

Restoring a snapshot destroys history between the restored primary and existing replicas #26544

Restoring a snapshot destroys history between the restored primary and existing replicas #26544

Comments

jasontedor commented Sep 8, 2017

ywelsch commented Sep 13, 2017