-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network partitions can cause divergence, dirty reads, and lost updates. #20031
Comments
thanks @aphyr - we'll dive into it |
Will this block the 5.0 release (it seems as though it should, since resiliency is one of the features being heavily touted in the new version)? |
@aphyr thanks again for the clear description. Responding to the separate issues you mention:
With 5.0 we added the notion of "in sync copies" to keep track of valid copies. The valid copies are used to track which copies are allowed to become primaries. That set of what we call allocation ids is tracked in the cluster state. In the test a change to this set that was committed by the master was lost during a partition, causing a stale shard copy to be promoted to primary when the partition healed. This caused data loss. This issue is addressed in #20384 and will be part of the imminent 5.0 beta1 release. With it in place we run Jepsen for almost two days straight with no data loss. However, as mentioned in the ticket, there is still a very small chance of this happening. We have added an entry to our resiliency page about this. We are working to fix that too, although that will take quite a bit longer and will not make it for the 5.0 release. @aphyr I would love it if you can verify our findings. I will be happy to supply a snapshot build of beta1. |
Hi @bleskes. I'd be delighted to work on verification. I've got an ongoing contract that's keeping me pretty busy right now, but if you'd like to send me an email ([email protected]) I'll reach out when I've got availability for more testing. |
Pinging @elastic/es-distributed |
i wonder when primary shard send operation to its replicas, and one of the replicas acknowleged this request to primary but the other replica did not conform, and primary not acknowlege to client. then a read request send to the acknowleged replica,Can it read the latest data that has not been acknowlege to the client?@bleskes |
@saiergong Yes. |
many thanks @jasontedor , but do we think its ok? or we have plan to resolve this? |
@saiergong Indeed it is not ideal however we consider it to be a lower priority than the other problems that we are currently trying to solve. |
i know , thanks @jasontedor |
All known issues raised here related to divergence and lost updates of documents have been fixed as part of the sequence numbers effort and the new cluster coordination subsystem introduced in ES 7, with tests in place that run checks similar to what the Jepsen Elasticsearch tests do. Dirty reads are documented in the Elasticsearch reference docs, and I've opened a dedicated issue (#52400) to track any related work on this. |
Hello again! I hope everyone's having a nice week. :-)
Since #7572 (A network partition can cause in flight documents to be lost) is now closed, and the resiliency page reports that "Loss of documents during network partition" is now addressed in 5.0.0, I've updated the Jepsen Elasticsearch tests for 5.0.0-alpha5. In these tests, Elasticsearch appears to allow:
This work was funded by Crate.io, which uses Elasticsearch internally and is affected by the same issues in their fork.
Here's the test code, and the supporting library which uses the Java client (also at 5.0.0-alpha5). I've lowered timeouts to help ES recover faster from the network shenanigans in our tests.
Several clients concurrently index documents, each with a unique ID that is never retried. Writes succeed if the index request returns RestStatus.CREATED.
Meanwhile, several clients concurrently attempt to get recently inserted documents by ID. Reads are considered successful if response.isExists() is true.
During this time, we perform a sequence of simple minority/majority network partitions: 20 seconds on, 10 seconds off.
At the end of the test, we heal the network, allow 10 seconds for quiescence (I've set ES timeouts/polling intervals much lower than normal to allow faster convergence), and have every client perform an index refresh. Refreshes are retried until every shard reports successful. Somewhat confusingly, getSuccessfulShards + getFailedShards is rarely equal to getTotalShards, and .getShardFailures appears to always be empty; perhaps there's an indeterminate, unreported shard state between success and failure? In any case, I check that getTotalShards is equal to getSuccessfulShards.
Once every client has performed a refresh, each client performs a read of all documents in the index. We're looking for four-ish cases:
All of these cases appear to occur. Here's a case where some acknowledged writes were lost by every node (757 lost, 3374 present):
And here's a case where nodes disagree on the contents of the index
not-on-all
denotes documents which were present on some, but not all, nodes.some-lost
is the subset of those documents which were successfully inserted and should have been present.I've also demonstrated lost updates due to dirty read semantics plus
_version
divergence, but I haven't ported that test to ES 5.0.0 yet.Cheers!
Elasticsearch version: 5.0.0-alpha-5
Plugins installed: []
JVM version: Oracle JDK 1.8.0_91
OS version: Debian Jessie
Description of the problem including expected versus actual behavior: Elasticsearch should probably not forget about inserted documents which were acknowledged
Steps to reproduce:
elasticsearch
lein test
The text was updated successfully, but these errors were encountered: