Introduce a History UUID as a requirement for ops based recovery #26577

bleskes · 2017-09-11T15:00:51Z

The new ops based recovery, introduce as part of #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR).

The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR).

We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.

Last the PR also tightens up the connection between the checkpoint file, it's translog and it's lucene index by adding both the translog uuid and the history uuid to it and verifying on read.

PS I still want to go through the test and make sure the coverage is good enough. I also want to validate the BWC logic that will only run properly on CI once this is backported. That said, I think we can start reviewing.

jasontedor

This is great. The one question I have is regarding isTranslogReadyForSequenceNumberBasedRecovery where we use to have a norelease when the local checkpoint on the target shard is greater than the maximum sequence number on the source shard. I think that should not be possible any more. Would you double check and make this a hard assertion?

jasontedor · 2017-09-11T23:00:04Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+                translogUUID = loadTranslogUUIDFromCommit(writer);
+                // We expect that this shard already exists, so it must already have an existing translog else something is badly wrong!
+                if (translogUUID == null) {
+                    throw new IndexFormatTooOldException("translog", "translog has no generation nor a UUID - this might be an index from a previous version consider upgrading to N-1 first");


I recognize that this is only moving code but can you use Version.CURRENT to specify a more precise version in the error message rather than a generic message?

jasontedor · 2017-09-11T23:05:50Z

core/src/main/java/org/elasticsearch/index/translog/Translog.java

+        if (checkpoint.translogUUID.equals(TRANSLOG_UUID_NA) == false &&
+            checkpoint.translogUUID.equals(translogUUID) == false
+            ) {
+            throw new TranslogCorruptedException("expected translog UUID " + translogUUID+ " but got: " + checkpoint.translogUUID +


Nit: translogUUID+ -> translogUUID +

jasontedor · 2017-09-11T23:06:10Z

core/src/main/java/org/elasticsearch/index/translog/Translog.java

+        if (checkpoint.historyUUID.equals(HISTORY_UUID_NA) == false &&
+            checkpoint.historyUUID.equals(historyUUID) == false
+            ) {
+            throw new TranslogCorruptedException("expected history UUID " + historyUUID+ " but got: " + checkpoint.historyUUID +


Nit: historyUUID+ -> historyUUID +

jasontedor · 2017-09-11T23:07:00Z

core/src/main/java/org/elasticsearch/index/translog/TranslogWriter.java

+        long globalCheckpoint,
+        long minTranslogGeneration,
+        Path translogFile,
+        long generation, String translogUUID, String historyUUID) throws IOException {


Can you wrap these to the next two lines like the rest of the parameters?

s1monw · 2017-09-12T10:35:06Z

I am trying to follow the logic here and I am not fully behind why we need this on the transaction log level. From my perspective (and I might miss something) an ID on the lucene commit level would be enough (it could even just be a simple number starting with 0 which allows use to see how often it happened). We would create / update the ID every time except of when we use open mode OPEN_INDEX_AND_TRANSLOG or if we use the translog tool. This way we can simply bake it into the commit and we know if we are compatible with the primary? I would really like us to keep the translog as simple as possible and do most of the "can we do X" on the commit level.

bleskes · 2017-09-12T11:38:34Z

From my perspective (and I might miss something) an ID on the lucene commit leve

I started with that - just a uuid in the commit. That is sadly not enough - we currently store the global checkpoint in the translog (ckp file) and recovery logic needs to rely on it and make sure it was written for the same history as on the primary and the same history the replica lucene index . Since there are stages in the recovery where we pave over the lucene index but don't yet touch the translog, we need to make sure that the checkpoint we read actually belongs to the lucene index. We currently don't really know which translog the checkpoint belongs to and this is why I added the translog uuid to it. I think it's a good integrity check. We can use that translog uuid to tie the checkpoint back to lucene, which contains the history uuid. However adding the history uuid to the checkpoint prepares the ground for 7.0 (where we won't need to deal with indices that don't have sequence numbers). At that point we want to move to use history uuid is the primary identifier and use the translog uuid as an internal integrity marker for the translog. This is will allows us to fully move into a "history based model" and drop the current hybrid of file generations and seq# (i.e., generations files become an internal implementation detail of the translog, with better isolation). If this doesn't feel natural to you, I can remove it and we can talk about it later.

simple number starting with 0

That's appealing but we want to use the id for out of cluster semantics as well.

s1monw · 2017-09-12T11:44:55Z

ok lets iterate on this for a moment. The translog ID is fixed for the entire transaction log lifecycle so I don't understand why it's baked into the checkpoint. I really really dislike the fact that we write variable length strings into such a fragile file. I don't think this is necessary by any means. We can just write it into the file header or at some place like this since it never changes?

bleskes · 2017-09-12T11:54:16Z

I really really dislike the fact that we write variable length strings into such a fragile file.

As far as I can tell these are fixed length (22 bytes + 1 byte for length). I agree that's important.

We can just write it into the file header or at some place like this since it never changes

It's written into the main translog files headers, so we can open up those and validate. I felt that was more complicated (now we open multiple files, first the ckp then the translog file generation it refers to). Also, we currently have no protection for people copying ckp files from one place to another. I feel such a protection is a nice bonus to add?

s1monw · 2017-09-12T12:07:37Z

this checkpoint file was designed to be as small as possible and not the sink for all kinds of information. We tried to fight for every bit in there and now we add a bunch of random bytes to it I feel I like it's not the right thing. I tried to convince you about this the last time since there are so many unknowns about and this is such a fragile area. I personally feel this translog gets a bit out of control and I am very very worried about it. I don't think we should design our transaction log around some crazy stuff we do with S/R and tools to recover transaction logs. We should rather fix S/R and the tools instead of dumping all kinds of information into a super critical file. This entire PR feels wrong to me to be honest and I have a hard time to with it. if you want to add anything to it I think it should be a at most a long. Even that I feel like is wrong here for this purpose. We can think of protecting the situation where somebody copies over checkpoints but lets do this seperately it's an impl detail of a translog and it should be a checkpoint tight to a single .tlog file which is not the case. Also the file must have the right name as well here which is only a single step away from painting the hash into the file if you really want to trick us. I would really like us to look at each individual issue and try to solve it. I don't think this is the right way of doing this.

bleskes · 2017-09-12T12:28:23Z

this checkpoint file was designed to be as small as possible and not the sink for all kinds of information. We tried to fight for every bit in there and now we add a bunch of random bytes to it I feel I like it's not the right thing. I tried to convince you about this the last time since there are so many unknowns about and this is such a fragile area.

We settled on deciding that the checkpoint file needs to stay below 512. I was aiming for the simplest possible code change while staying well below this limit. I can see your argument and I can move the history uuid to the translog files header (i.e., smaller checkpoint, more code). If this is what we end up doing, I'll move this to a separate change and only use the translog uuid for the validation here (i.e., opening the translog.ckp file and the associated translog generation file for the header).

I don't think we should design our transaction log around some crazy stuff we do with S/R and tools to recover transaction logs. We should rather fix S/R and the tools instead of dumping all kinds of information into a super critical file. This entire PR feels wrong to me to be honest and I have a hard time to with it.

There is more to this than S/R. I will reach out to discuss this on a different channel.

We can think of protecting the situation where somebody copies over checkpoints but lets do this seperately it's an impl detail of a translog and it should be a checkpoint tight to a single .tlog file which is not the case

As said this was just a nice to have - by no mean a watertight scheme.

s1monw · 2017-09-12T13:42:46Z

We settled on deciding that the checkpoint file needs to stay below 512.

That wasn't my intention to give the impression we should make this the goal. The goal to me is to try very hard to keep it as small as possible. 512 is a hard stop IMO

bleskes · 2017-09-13T14:20:34Z

I think that should not be possible any more. Would you double check and make this a hard assertion?

@jasontedor re ^^ - I might be missing something, but we can only do this when we actually do the snapshpot/restore change?

ywelsch · 2017-09-13T14:40:24Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -174,15 +177,23 @@ public InternalEngine(EngineConfig engineConfig) throws EngineException {
                switch (openMode) {
                    case OPEN_INDEX_AND_TRANSLOG:
                        writer = createWriter(false);
+                        String existingHistoryUUID = loadHistoryUUIDFromCommit(writer);


Determining whether to generate a fresh or reuse the existing historyUUID should be done based on the ShardRouting.recoverySource() and can be set in the engine config:

EMPTY_STORE, SNAPSHOT, LOCAL_SHARDS => create fresh history ID
EXISTING_STORE, PEER => use existing history id (if available, otherwise create fresh)

That's a great suggestion. I prefer to use it as a follow up PR when I actually fix the snapshot issue.

I am ok with a followup. good catch @ywelsch

s1monw

I left some questions.. looks great so far

s1monw · 2017-09-13T14:34:56Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+    private String loadHistoryUUIDFromCommit(final IndexWriter writer) throws IOException {
+        String uuid = commitDataAsMap(writer).get(HISTORY_UUID_KEY);
+        if (uuid == null) {
+            assert config().getIndexSettings().getIndexVersionCreated().before(Version.V_6_0_0_rc1) :


should we make this a hard fail? I wonder if we should return a null uuid here it will just fail later?!

yeah, can do - it shouldn't happen anyway. I tend to communicate this "if this happens it's a bug" with assertions but I'm happy to convert to an IllegalState or something.

On a second thought - assertions cause nodes to die, which means our integration tests will be stopped on the spot - with an illegal state, the nodes we recover by doing another recovery and we'll never see the failure.

fair enough

s1monw · 2017-09-13T14:37:07Z

core/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

@@ -366,7 +367,7 @@ public static long getStartingSeqNo(final RecoveryTarget recoveryTarget) {
            } else {
                return SequenceNumbers.UNASSIGNED_SEQ_NO;
            }
-        } catch (final IOException e) {
+        } catch (final IOException|TranslogCorruptedException e) {


hmm is this a bug?

it's a left over from my previous change. Will remove.

s1monw · 2017-09-13T14:39:01Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -340,6 +351,12 @@ private void recoverFromTranslogInternal() throws IOException {
            flush(true, true);
        } else if (translog.isCurrent(translogGeneration) == false) {
            commitIndexWriter(indexWriter, translog, lastCommittedSegmentInfos.getUserData().get(Engine.SYNC_COMMIT_ID));
+            refreshLastCommittedSegmentInfos();
+        } else if (lastCommittedSegmentInfos.getUserData().containsKey(HISTORY_UUID_KEY) == false)  {


do we also have to put it in the commit if the existing UUID is different to the one we have in the local var?

with the current logic it will be a bug - the history should stay the same for the index life time and only go from null -> a value. With Yannick's suggestion it might be needed but I prefer to do it as a follow up.

s1monw

LGTM

bleskes · 2017-09-14T06:01:08Z

@jasontedor can you take a second look too when you have a few moments?

jasontedor

I left one comment, you can push when addressed.

jasontedor · 2017-09-14T17:58:51Z

core/src/main/java/org/elasticsearch/index/shard/StoreRecovery.java

@@ -162,10 +164,11 @@ void addIndices(
             * document-level semantics.
             */
            writer.setLiveCommitData(() -> {
-                final HashMap<String, String> liveCommitData = new HashMap<>(2);
+                final HashMap<String, String> liveCommitData = new HashMap<>(3);


It looks like this should be four (it was already "wrong" at two). (It doesn't really matter, but if we are going to size the map, let's size it correctly and avoid a resize).

good one. will fix.

bleskes · 2017-09-14T18:25:33Z

Thx @jasontedor @s1monw

) The new ops based recovery, introduce as part of #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR). The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR). We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.

* master: fix testSniffNodes to use the new error message Add check for invalid index in WildcardExpressionResolver (elastic#26409) Docs: Use single-node discovery.type for dev example Filter unsupported relation for range query builder (elastic#26620) Fix kuromoji default stoptags (elastic#26600) [Docs] Add description for missing fields in Reindex/Update/Delete By Query (elastic#26618) [Docs] Update ingest.asciidoc (elastic#26599) Better message text for ResponseException [DOCS] Remove edit link from ML node enable bwc testing fix StartRecoveryRequestTests.testSerialization Add bad_request to the rest-api-spec catch params (elastic#26539) Introduce a History UUID as a requirement for ops based recovery (elastic#26577) Add missing catch arguments to the rest api spec (elastic#26536)

@ywelsch

…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544

@ywelsch

…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544

@ywelsch

…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544

bleskes added 7 commits September 11, 2017 12:16

added a history uuid to engine/translog infra

ca9b913

fix TruncateTranslogCommand dir locking

6c7b399

fix flush docs

3f4991c

make history uuid a requirement for ops based recovery

30921fc

tighten up reading global checkpoint reading

152b87e

fix PeerRecoverySourceService or PeerRecoveryTargetServiceTests

5d3693b

testDifferentHistoryUUIDDisablesOPsRecovery

d7a39f1

bleskes added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. :Sequence IDs >enhancement v6.0.0 v7.0.0 labels Sep 11, 2017

bleskes requested a review from jasontedor September 11, 2017 15:00

fix testCorruptTranslogTruncation to never expect seq no based recovery

2e4dc13

jasontedor reviewed Sep 12, 2017

View reviewed changes

bleskes added 7 commits September 13, 2017 15:47

fix TranslogTests.testWithRandomException

5f98e48

add a better error info

8b68522

get uuid length right

853322b

typo testHistoryUUIDIsAdded

27b51b7

remove translog uuid and history uuid from the checkpoint

af3ae0f

add testRecoverFromStoreSetsHistoryUUIDIfNeeded

fa4430f

fix StartRecoveryRequestTests.testSerialization

da69f42

ywelsch reviewed Sep 13, 2017

View reviewed changes

s1monw reviewed Sep 13, 2017

View reviewed changes

bleskes added 2 commits September 13, 2017 21:20

remove unneeded catch for TranslogCorruptedException

1b40bba

disable BWC tests untul backported to 6.0

06951f6

s1monw approved these changes Sep 13, 2017

View reviewed changes

jasontedor approved these changes Sep 14, 2017

View reviewed changes

bleskes added 2 commits September 14, 2017 21:16

hash map counting

63983ba

Merge remote-tracking branch 'upstream/master' into history_uuid

0a86367

bleskes merged commit 1ca0b5e into elastic:master Sep 14, 2017

bleskes mentioned this pull request Sep 18, 2017

Restoring from snapshot should force generation of a new history uuid #26694

Merged

colings86 added v6.0.0-rc1 and removed v6.0.0 labels Sep 22, 2017

jpountz removed the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Jan 29, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a History UUID as a requirement for ops based recovery #26577

Introduce a History UUID as a requirement for ops based recovery #26577

bleskes commented Sep 11, 2017

jasontedor left a comment

jasontedor Sep 11, 2017

jasontedor Sep 11, 2017

jasontedor Sep 11, 2017

jasontedor Sep 11, 2017

s1monw commented Sep 12, 2017

bleskes commented Sep 12, 2017

s1monw commented Sep 12, 2017

bleskes commented Sep 12, 2017

s1monw commented Sep 12, 2017

bleskes commented Sep 12, 2017 •

edited

Loading

s1monw commented Sep 12, 2017

bleskes commented Sep 13, 2017

ywelsch Sep 13, 2017

bleskes Sep 13, 2017

s1monw Sep 13, 2017

s1monw left a comment

s1monw Sep 13, 2017

bleskes Sep 13, 2017

s1monw Sep 13, 2017

bleskes Sep 13, 2017

s1monw Sep 13, 2017

s1monw Sep 13, 2017

bleskes Sep 13, 2017

s1monw Sep 13, 2017

s1monw Sep 13, 2017

bleskes Sep 13, 2017

s1monw Sep 13, 2017

s1monw left a comment

bleskes commented Sep 14, 2017

jasontedor left a comment

jasontedor Sep 14, 2017

bleskes Sep 14, 2017

bleskes commented Sep 14, 2017

Introduce a History UUID as a requirement for ops based recovery #26577

Introduce a History UUID as a requirement for ops based recovery #26577

Conversation

bleskes commented Sep 11, 2017

jasontedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Sep 12, 2017

bleskes commented Sep 12, 2017

s1monw commented Sep 12, 2017

bleskes commented Sep 12, 2017

s1monw commented Sep 12, 2017

bleskes commented Sep 12, 2017 • edited Loading

s1monw commented Sep 12, 2017

bleskes commented Sep 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw left a comment

Choose a reason for hiding this comment

bleskes commented Sep 14, 2017

jasontedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Sep 14, 2017

bleskes commented Sep 12, 2017 •

edited

Loading