-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a History UUID as a requirement for ops based recovery #26577
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. The one question I have is regarding isTranslogReadyForSequenceNumberBasedRecovery
where we use to have a norelease
when the local checkpoint on the target shard is greater than the maximum sequence number on the source shard. I think that should not be possible any more. Would you double check and make this a hard assertion?
translogUUID = loadTranslogUUIDFromCommit(writer); | ||
// We expect that this shard already exists, so it must already have an existing translog else something is badly wrong! | ||
if (translogUUID == null) { | ||
throw new IndexFormatTooOldException("translog", "translog has no generation nor a UUID - this might be an index from a previous version consider upgrading to N-1 first"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that this is only moving code but can you use Version.CURRENT to specify a more precise version in the error message rather than a generic message?
if (checkpoint.translogUUID.equals(TRANSLOG_UUID_NA) == false && | ||
checkpoint.translogUUID.equals(translogUUID) == false | ||
) { | ||
throw new TranslogCorruptedException("expected translog UUID " + translogUUID+ " but got: " + checkpoint.translogUUID + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: translogUUID+
-> translogUUID +
if (checkpoint.historyUUID.equals(HISTORY_UUID_NA) == false && | ||
checkpoint.historyUUID.equals(historyUUID) == false | ||
) { | ||
throw new TranslogCorruptedException("expected history UUID " + historyUUID+ " but got: " + checkpoint.historyUUID + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: historyUUID+
-> historyUUID +
long globalCheckpoint, | ||
long minTranslogGeneration, | ||
Path translogFile, | ||
long generation, String translogUUID, String historyUUID) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you wrap these to the next two lines like the rest of the parameters?
I am trying to follow the logic here and I am not fully behind why we need this on the transaction log level. From my perspective (and I might miss something) an ID on the lucene commit level would be enough (it could even just be a simple number starting with 0 which allows use to see how often it happened). We would create / update the ID every time except of when we use open mode |
I started with that - just a uuid in the commit. That is sadly not enough - we currently store the global checkpoint in the translog (ckp file) and recovery logic needs to rely on it and make sure it was written for the same history as on the primary and the same history the replica lucene index . Since there are stages in the recovery where we pave over the lucene index but don't yet touch the translog, we need to make sure that the checkpoint we read actually belongs to the lucene index. We currently don't really know which translog the checkpoint belongs to and this is why I added the translog uuid to it. I think it's a good integrity check. We can use that translog uuid to tie the checkpoint back to lucene, which contains the history uuid. However adding the history uuid to the checkpoint prepares the ground for 7.0 (where we won't need to deal with indices that don't have sequence numbers). At that point we want to move to use history uuid is the primary identifier and use the translog uuid as an internal integrity marker for the translog. This is will allows us to fully move into a "history based model" and drop the current hybrid of file generations and seq# (i.e., generations files become an internal implementation detail of the translog, with better isolation). If this doesn't feel natural to you, I can remove it and we can talk about it later.
That's appealing but we want to use the id for out of cluster semantics as well. |
ok lets iterate on this for a moment. The translog ID is fixed for the entire transaction log lifecycle so I don't understand why it's baked into the checkpoint. I really really dislike the fact that we write variable length strings into such a fragile file. I don't think this is necessary by any means. We can just write it into the file header or at some place like this since it never changes? |
As far as I can tell these are fixed length (22 bytes + 1 byte for length). I agree that's important.
It's written into the main translog files headers, so we can open up those and validate. I felt that was more complicated (now we open multiple files, first the ckp then the translog file generation it refers to). Also, we currently have no protection for people copying ckp files from one place to another. I feel such a protection is a nice bonus to add? |
this checkpoint file was designed to be as small as possible and not the sink for all kinds of information. We tried to fight for every bit in there and now we add a bunch of random bytes to it I feel I like it's not the right thing. I tried to convince you about this the last time since there are so many unknowns about and this is such a fragile area. I personally feel this translog gets a bit out of control and I am very very worried about it. I don't think we should design our transaction log around some crazy stuff we do with S/R and tools to recover transaction logs. We should rather fix S/R and the tools instead of dumping all kinds of information into a super critical file. This entire PR feels wrong to me to be honest and I have a hard time to with it. if you want to add anything to it I think it should be a at most a |
We settled on deciding that the checkpoint file needs to stay below 512. I was aiming for the simplest possible code change while staying well below this limit. I can see your argument and I can move the history uuid to the translog files header (i.e., smaller checkpoint, more code). If this is what we end up doing, I'll move this to a separate change and only use the translog uuid for the validation here (i.e., opening the translog.ckp file and the associated translog generation file for the header).
There is more to this than S/R. I will reach out to discuss this on a different channel.
As said this was just a nice to have - by no mean a watertight scheme. |
That wasn't my intention to give the impression we should make this the goal. The goal to me is to try very hard to keep it as small as possible. |
@jasontedor re ^^ - I might be missing something, but we can only do this when we actually do the snapshpot/restore change? |
@@ -174,15 +177,23 @@ public InternalEngine(EngineConfig engineConfig) throws EngineException { | |||
switch (openMode) { | |||
case OPEN_INDEX_AND_TRANSLOG: | |||
writer = createWriter(false); | |||
String existingHistoryUUID = loadHistoryUUIDFromCommit(writer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Determining whether to generate a fresh or reuse the existing historyUUID should be done based on the ShardRouting.recoverySource() and can be set in the engine config:
EMPTY_STORE, SNAPSHOT, LOCAL_SHARDS => create fresh history ID
EXISTING_STORE, PEER => use existing history id (if available, otherwise create fresh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great suggestion. I prefer to use it as a follow up PR when I actually fix the snapshot issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with a followup. good catch @ywelsch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some questions.. looks great so far
private String loadHistoryUUIDFromCommit(final IndexWriter writer) throws IOException { | ||
String uuid = commitDataAsMap(writer).get(HISTORY_UUID_KEY); | ||
if (uuid == null) { | ||
assert config().getIndexSettings().getIndexVersionCreated().before(Version.V_6_0_0_rc1) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this a hard fail? I wonder if we should return a null uuid here it will just fail later?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, can do - it shouldn't happen anyway. I tend to communicate this "if this happens it's a bug" with assertions but I'm happy to convert to an IllegalState or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah ++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought - assertions cause nodes to die, which means our integration tests will be stopped on the spot - with an illegal state, the nodes we recover by doing another recovery and we'll never see the failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
@@ -366,7 +367,7 @@ public static long getStartingSeqNo(final RecoveryTarget recoveryTarget) { | |||
} else { | |||
return SequenceNumbers.UNASSIGNED_SEQ_NO; | |||
} | |||
} catch (final IOException e) { | |||
} catch (final IOException|TranslogCorruptedException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm is this a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a left over from my previous change. Will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
@@ -340,6 +351,12 @@ private void recoverFromTranslogInternal() throws IOException { | |||
flush(true, true); | |||
} else if (translog.isCurrent(translogGeneration) == false) { | |||
commitIndexWriter(indexWriter, translog, lastCommittedSegmentInfos.getUserData().get(Engine.SYNC_COMMIT_ID)); | |||
refreshLastCommittedSegmentInfos(); | |||
} else if (lastCommittedSegmentInfos.getUserData().containsKey(HISTORY_UUID_KEY) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also have to put it in the commit if the existing UUID is different to the one we have in the local var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the current logic it will be a bug - the history should stay the same for the index life time and only go from null -> a value. With Yannick's suggestion it might be needed but I prefer to do it as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jasontedor can you take a second look too when you have a few moments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment, you can push when addressed.
@@ -162,10 +164,11 @@ void addIndices( | |||
* document-level semantics. | |||
*/ | |||
writer.setLiveCommitData(() -> { | |||
final HashMap<String, String> liveCommitData = new HashMap<>(2); | |||
final HashMap<String, String> liveCommitData = new HashMap<>(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this should be four (it was already "wrong" at two). (It doesn't really matter, but if we are going to size the map, let's size it correctly and avoid a resize).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good one. will fix.
Thx @jasontedor @s1monw |
) The new ops based recovery, introduce as part of #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR). The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR). We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.
) The new ops based recovery, introduce as part of #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR). The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR). We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.
* master: fix testSniffNodes to use the new error message Add check for invalid index in WildcardExpressionResolver (elastic#26409) Docs: Use single-node discovery.type for dev example Filter unsupported relation for range query builder (elastic#26620) Fix kuromoji default stoptags (elastic#26600) [Docs] Add description for missing fields in Reindex/Update/Delete By Query (elastic#26618) [Docs] Update ingest.asciidoc (elastic#26599) Better message text for ResponseException [DOCS] Remove edit link from ML node enable bwc testing fix StartRecoveryRequestTests.testSerialization Add bad_request to the rest-api-spec catch params (elastic#26539) Introduce a History UUID as a requirement for ops based recovery (elastic#26577) Add missing catch arguments to the rest api spec (elastic#26536)
…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544
…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544
…#26694) Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore. As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid. Relates #10708 Closes #26544
The new ops based recovery, introduce as part of #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR).
The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR).
We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.
Last the PR also tightens up the connection between the checkpoint file, it's translog and it's lucene index by adding both the translog uuid and the history uuid to it and verifying on read.
PS I still want to go through the test and make sure the coverage is good enough. I also want to validate the BWC logic that will only run properly on CI once this is backported. That said, I think we can start reviewing.