-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot Repositories Containing a Mix of pre and post v7.6 Snapshots Can Become Corrupted #57798
Closed
2 tasks done
Labels
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Comments
original-brownbear
added
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
labels
Jun 8, 2020
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
elasticmachine
added
the
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
label
Jun 8, 2020
original-brownbear
added a commit
that referenced
this issue
Jun 8, 2020
* Fix Bug With RepositoryData Caching This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates #57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 8, 2020
* Fix Bug With RepositoryData Caching This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates elastic#57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 8, 2020
* Fix Bug With RepositoryData Caching This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates elastic#57798
original-brownbear
added a commit
that referenced
this issue
Jun 8, 2020
* Fix Bug With RepositoryData Caching This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates #57798
original-brownbear
added a commit
that referenced
this issue
Jun 8, 2020
* Fix Bug With RepositoryData Caching This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates #57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 8, 2020
Fix broken numeric shard generations when reading them from the wire or physically from the physical repository. This should be the cheapest way to clean up broken shard generations in a BwC and safe-to-backport manner for now. We can potentially further optimize this by also not doing the checks on the generations based on the versions we see in the `RepositoryData` but I don't think it matters much since we will read `RepositoryData` from cache in almost all cases. Closes elastic#57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 8, 2020
* Fix Bug With RepositoryData Caching This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates elastic#57798
original-brownbear
added a commit
that referenced
this issue
Jun 8, 2020
This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates #57798
original-brownbear
added a commit
that referenced
this issue
Jun 8, 2020
Fix broken numeric shard generations when reading them from the wire or physically from the physical repository. This should be the cheapest way to clean up broken shard generations in a BwC and safe-to-backport manner for now. We can potentially further optimize this by also not doing the checks on the generations based on the versions we see in the `RepositoryData` but I don't think it matters much since we will read `RepositoryData` from cache in almost all cases. Closes #57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 8, 2020
Fix broken numeric shard generations when reading them from the wire or physically from the physical repository. This should be the cheapest way to clean up broken shard generations in a BwC and safe-to-backport manner for now. We can potentially further optimize this by also not doing the checks on the generations based on the versions we see in the `RepositoryData` but I don't think it matters much since we will read `RepositoryData` from cache in almost all cases. Closes elastic#57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 8, 2020
Fix broken numeric shard generations when reading them from the wire or physically from the physical repository. This should be the cheapest way to clean up broken shard generations in a BwC and safe-to-backport manner for now. We can potentially further optimize this by also not doing the checks on the generations based on the versions we see in the `RepositoryData` but I don't think it matters much since we will read `RepositoryData` from cache in almost all cases. Closes elastic#57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 8, 2020
Fix broken numeric shard generations when reading them from the wire or physically from the physical repository. This should be the cheapest way to clean up broken shard generations in a BwC and safe-to-backport manner for now. We can potentially further optimize this by also not doing the checks on the generations based on the versions we see in the `RepositoryData` but I don't think it matters much since we will read `RepositoryData` from cache in almost all cases. Closes elastic#57798
original-brownbear
added a commit
that referenced
this issue
Jun 8, 2020
Fix broken numeric shard generations when reading them from the wire or physically from the physical repository. This should be the cheapest way to clean up broken shard generations in a BwC and safe-to-backport manner for now. We can potentially further optimize this by also not doing the checks on the generations based on the versions we see in the `RepositoryData` but I don't think it matters much since we will read `RepositoryData` from cache in almost all cases. Closes #57798
original-brownbear
added a commit
that referenced
this issue
Jun 8, 2020
Fix broken numeric shard generations when reading them from the wire or physically from the physical repository. This should be the cheapest way to clean up broken shard generations in a BwC and safe-to-backport manner for now. We can potentially further optimize this by also not doing the checks on the generations based on the versions we see in the `RepositoryData` but I don't think it matters much since we will read `RepositoryData` from cache in almost all cases. Closes #57798
original-brownbear
added a commit
that referenced
this issue
Jun 8, 2020
Fix broken numeric shard generations when reading them from the wire or physically from the physical repository. This should be the cheapest way to clean up broken shard generations in a BwC and safe-to-backport manner for now. We can potentially further optimize this by also not doing the checks on the generations based on the versions we see in the `RepositoryData` but I don't think it matters much since we will read `RepositoryData` from cache in almost all cases. Closes #57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 10, 2020
Use the the hack used in `CorruptedBlobStoreRepositoryIT` in more snapshot failure tests to verify that BwC repository metadata is handled properly in these so far not-test-covered scenarios. Also, some minor related dry-up of snapshot tests. Relates elastic#57798
original-brownbear
added a commit
that referenced
this issue
Jun 10, 2020
Use the the hack used in `CorruptedBlobStoreRepositoryIT` in more snapshot failure tests to verify that BwC repository metadata is handled properly in these so far not-test-covered scenarios. Also, some minor related dry-up of snapshot tests. Relates #57798
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jun 10, 2020
) Use the the hack used in `CorruptedBlobStoreRepositoryIT` in more snapshot failure tests to verify that BwC repository metadata is handled properly in these so far not-test-covered scenarios. Also, some minor related dry-up of snapshot tests. Relates elastic#57798
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Repositories that contain both snapshots from before version 7.6 and after 7.6 can become dysfunctional and in some cases corrupted by ES v7.7 clusters as a result of a mistake in how
RepositoryData
is cached.The
RepositoryData
is cached includingShardGenerations
that include numeric generation values that might not be reliable (any failed snapshot finalization that had at least one individual shard snapshot would cause an incorrect shard generation to be tracked).This leads to two stages of broken behavior:
ShardGenerations
will not be physically written to the repository. The issue will show up in errors like while creating new snapshots of affected shards, leading toPARTIAL
snapshots because the affected shards will never successfully snapshot.Also, snapshot deletes will log the same error, but will work otherwise. This leads to the second stage of the issue described below.
At this stage of the problem, the repository can be fixed and further corruption prevented by setting the setting the repository setting
cache_repository_data
tofalse
.RepositoryGenerations
that were incorrectly cached, will be written to the repository physically.Once this has happened the repository is physically corrupted and the only way to fix it at this point is to delete all snapshots referencing the broken shards.
We will do two steps of fixing things here:
RepositoryData
caching Fix Bug With RepositoryData Caching #57785RepositoryData
so that upgrading restores the affected repositories to full functionalitycc @ywelsch , @paulcoghlan
The text was updated successfully, but these errors were encountered: