Skip to content

Commit

Permalink
Update resiliency docs (#19303)
Browse files Browse the repository at this point in the history
Adds clarifications about Jepsen tests and new section on issues with versioning.
  • Loading branch information
ywelsch authored Jul 8, 2016
1 parent 982e01d commit 7dff8fb
Showing 1 changed file with 24 additions and 9 deletions.
33 changes: 24 additions & 9 deletions docs/resiliency/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ We are committed to tracking down and fixing all the issues that are posted.
==== Jepsen Tests

The Jepsen platform is specifically designed to test distributed systems. It is not a single test and is regularly adapted
to create new scenarios. We have ported all published scenarios to our testing infrastructure. Of course
as the system evolves, new scenarios can come up that are not yet covered. We are committed to investigating all new scenarios and will
report issues that we find on this page and in our GitHub repository.
to create new scenarios. We have currently ported all published Jepsen scenarios that deal with loss of acknowledged writes to our testing
framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating
all new scenarios and will report issues that we find on this page and in our GitHub repository.

[float]
=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)
Expand Down Expand Up @@ -102,6 +102,19 @@ space. The following issues have been identified:

Other safeguards are tracked in the meta-issue {GIT}11511[#11511].

[float]
=== The _version field may not uniquely identify document content during a network partition (STATUS: ONGOING)

When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue
indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is
partitioned away. It won't acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed.
The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to
step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from
the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted
writes for the same document (see {GIT}19269[#19269]).

We are currently implementing Sequence numbers {GIT}10708[#10708] which better track primary changes. Sequence numbers thus provide a basis
for uniquely identifying writes even in the presence of network partitions and will replace `_version` in operations that require this.

[float]
=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)
Expand All @@ -119,20 +132,22 @@ in the case of each type of failure. The plan is to have a test case that valida
[float]
=== Run Jepsen (STATUS: ONGOING)

We have ported all of the known scenarios in the Jepsen blogs to our testing infrastructure. The new tests are run continuously in our
testing farm and are passing. We are also working on running Jepsen independently to verify that no failures are found.
We have ported the known scenarios in the Jepsen blogs that check loss of acknowledged writes to our testing infrastructure.
The new tests are run continuously in our testing farm and are passing. We are also working on running Jepsen independently to verify
that no failures are found.


== Unreleased

[float]
=== Port Jepsen tests to our testing framework (STATUS: UNRELEASED, V5.0.0)
=== Port Jepsen tests dealing with loss of acknowledged writes to our testing framework (STATUS: UNRELEASED, V5.0.0)

We have increased our test coverage to include scenarios tested by Jepsen, as described in the Elasticsearch related blogs. We make heavy
use of randomization to expand on the scenarios that can be tested and to introduce new error conditions.
We have increased our test coverage to include scenarios tested by Jepsen that demonstrate loss of acknowledged writes, as described in
the Elasticsearch related blogs. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce
new error conditions.
You can follow the work on the master branch of the
https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class],
where the `testAckedIndexing` test was specifically added to cover known Jepsen related scenarios.
where the `testAckedIndexing` test was specifically added to check that we don't lose acknowledged writes in various failure scenarios.


[float]
Expand Down

0 comments on commit 7dff8fb

Please sign in to comment.