-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost update and cyclic information flow on process crash #14890
Comments
Here's another case, this one on 3.5.6: http://jepsen.io.s3.amazonaws.com/analyses/etcd-3.5.6/20221204T185724.970-0500.zip |
Oh this is fascinating. Check this cycle (from the 3.5.6 case just above this comment) out: Recall that we only ever write values once. The only write of key 15063 with value 23 was: {:member-id 5579440767535971996, :revision 220019, :raft-term 15},
{:prev-kv
[15063
{:value 29,
:version 20,
:create-revision 219980,
:mod-revision 220017}],
:prev-kv? true,
:header {:member-id 0, :revision 220019, :raft-term 0}} Yet somehow we have two different reads of 15063 = 23: (->> t :history h/client-ops h/oks (h/filter (fn [op] (some (fn [[f k v]] (and (= k 15063) (= v 23))) (:value op)))) pprint)
...
{:member-id 5198333996179334402, :revision 220003, :raft-term 14},
{:count 1,
:kvs
{15063
{:value 23,
:version 14,
:create-revision 219980,
:mod-revision 220002}}
{:member-id 5579440767535971996, :revision 220022, :raft-term 15},
{15063
{:value 23,
:version 21,
:create-revision 219980,
:mod-revision 220019}}, I've clipped these a bit for clarity: full logs are here:
```clj
jepsen.etcd=> (->> t :history h/client-ops h/oks (h/filter (fn [op] (some (fn [[f k v]] (and (= k 15063) (= v 23))) (:value op)))) pprint)
[{:process 484,
:type :ok,
:f :txn,
:value [[:r 15062 29] [:r 15063 23] [:w 15062 32] [:r 15062 32]],
:debug
{:txn-res
{:succeeded? true,
:txns [],
:header
{:member-id 5198333996179334402, :revision 220003, :raft-term 14},
:results
[{:count 1,
:kvs
{15062
{:value 29,
:version 16,
:create-revision 219976,
:mod-revision 219999}},
:more? false,
:header {:member-id 0, :revision 220002, :raft-term 0}}
{:count 1,
:kvs
{15063
{:value 23,
:version 14,
:create-revision 219980,
:mod-revision 220002}},
:more? false,
:header {:member-id 0, :revision 220002, :raft-term 0}}
{:prev-kv
[15062
{:value 29,
:version 16,
:create-revision 219976,
:mod-revision 219999}],
:prev-kv? true,
:header {:member-id 0, :revision 220003, :raft-term 0}}
{:count 1,
:kvs
{15062
{:value 32,
:version 17,
:create-revision 219976,
:mod-revision 220003}},
:more? false,
:header {:member-id 0, :revision 220003, :raft-term 0}}]}},
:index 780977,
:time 117944800025}
{:process 438,
:type :ok,
:f :txn,
:value [[:r 15063 23] [:w 15064 3]],
:debug
{:txn-res
{:succeeded? true,
:txns [],
:header
{:member-id 5579440767535971996, :revision 220022, :raft-term 15},
:results
[{:count 1,
:kvs
{15063
{:value 23,
:version 21,
:create-revision 219980,
:mod-revision 220019}},
:more? false,
:header {:member-id 0, :revision 220021, :raft-term 0}}
{:prev-kv
[15064
{:value 27,
:version 10,
:create-revision 220004,
:mod-revision 220021}],
:prev-kv? true,
:header {:member-id 0, :revision 220022, :raft-term 0}}]}},
:index 781064,
:time 122404476575}
{:process 433,
:type :ok,
:f :txn,
:value [[:r 15062 32] [:w 15063 23] [:r 15063 23]],
:debug
{:txn-res
{:succeeded? true,
:txns [],
:header
{:member-id 5579440767535971996, :revision 220019, :raft-term 15},
:results
[{:count 1,
:kvs
{15062
{:value 32,
:version 17,
:create-revision 219976,
:mod-revision 220003}},
:more? false,
:header {:member-id 0, :revision 220018, :raft-term 0}}
{:prev-kv
[15063
{:value 29,
:version 20,
:create-revision 219980,
:mod-revision 220017}],
:prev-kv? true,
:header {:member-id 0, :revision 220019, :raft-term 0}}
{:count 1,
:kvs
{15063
{:value 23,
:version 21,
:create-revision 219980,
:mod-revision 220019}},
:more? false,
:header {:member-id 0, :revision 220019, :raft-term 0}}]}},
:index 781066,
:time 122404512264}
{:process 435,
:type :ok,
:f :txn,
:value [[:w 15064 1] [:r 15063 23]],
:debug
{:txn-res
{:succeeded? true,
:txns [],
:header
{:member-id -5790610604219245522, :revision 220024, :raft-term 15},
:results
[{:prev-kv
[15064
{:value 31,
:version 12,
:create-revision 220004,
:mod-revision 220023}],
:prev-kv? true,
:header {:member-id 0, :revision 220024, :raft-term 0}}
{:count 1,
:kvs
{15063
{:value 23,
:version 21,
:create-revision 219980,
:mod-revision 220019}},
:more? false,
:header {:member-id 0, :revision 220024, :raft-term 0}}]}},
:index 781067,
:time 122404516208}]
```
This is impossible! The write didn't take place until term 15 and revision 22019, but somehow at term 14 and revision 22003 we've got a client reading the value 23, apparently written at revision 220002. OK, what transaction interacted with revision 220002? jepsen.etcd=> (->> t :history h/client-ops h/oks (h/filter (fn [op] (-> op :debug :txn-res :header :revision (= 220002)))) pprint)
[] Nothing we know of--could have been a timed out txn though. 220001? jepsen.etcd=> (->> t :history h/client-ops h/oks (h/filter (fn [op] (-> op :debug :txn-res :header :revision (= 220001)))) pprint)
[{:process 382,
:type :ok,
:f :txn,
:value [[:r 15062 29] [:r 15062 29] [:w 15063 22] [:r 15063 22]],
:debug
{:txn-res
{:succeeded? true,
:txns [],
:header
{:member-id 1248562974741439108, :revision 220001, :raft-term 14},
:results
[{:count 1,
:kvs
{15062
{:value 29,
:version 16,
:create-revision 219976,
:mod-revision 219999}},
:more? false,
:header {:member-id 0, :revision 220000, :raft-term 0}}
{:count 1,
:kvs
{15062
{:value 29,
:version 16,
:create-revision 219976,
:mod-revision 219999}},
:more? false,
:header {:member-id 0, :revision 220000, :raft-term 0}}
{:prev-kv
[15063
{:value 19,
:version 12,
:create-revision 219980,
:mod-revision 220000}],
:prev-kv? true,
:header {:member-id 0, :revision 220001, :raft-term 0}}
{:count 1,
:kvs
{15063
{:value 22,
:version 13,
:create-revision 219980,
:mod-revision 220001}},
:more? false,
:header {:member-id 0, :revision 220001, :raft-term 0}}]}},
:index 780979,
:time 117945424227}] This is a write of 22 (not 23!) to key 15063. It executed at 220001, raft term 14. That write overwrote value 19 at version 12, mod-revision 220000, and returned a header with revision 220001. Something feels deeply wrong here. We shouldn't have these two versions 15063=23 with different mod-revisions! It almost feels like the leader for term 14 somehow got a copy of the write of 23 and like... applied it at revision 220002, let that write be visible to a reader, but then the leader for term 15 applied it and won out later. Both versions have the same create-revision, which helps rule out something weird like state leaking over from an earlier version of the test cluster... what on earth is going on here? |
Looks interesting. Would be happy to collaborate on reproducing this, however I'm struggling a bit in following the closure. I' working on introducing linearizability logs to etcd in #14045. However we still haven't got to implementing TXN. Can you give more context on how you validate linearizability of Txn requests? |
@aphyr Thanks for reporting this. Before I have a deep dive into this, just a couple of quick questions:
|
Yup! This is broadly the same strategy we used in the 2020 list append workload. I've also added a write-read register workload based on Elle.
Yup! Takes about 5 minutes to hit a strict-1SR violation with the repro instructions I gave above. Maybe a bit longer for G1c specifically.
As I mentioned above 3.5.6 exhibits the same behavior.
I'm afraid I'm in the middle of moving so my availability is going to be a hot mess for a few months. And, well... to be somewhat frank, my experience in #14092 was profoundly disappointing; I put weeks of time into describing and analyzing that behavior and it led nowhere. |
I just got time to have closer look at this issue. It turned out to be not an issue to me. Note that there is no cyclic dependency between any two transactions. All transactions are executed sequentially in etcd. Let's use the second issue (under G1c) as an example. Obviously the TXN in process 73 happened before the TXN in process 106, because the revision in header of process 73 TXN is smaller than process 106. All the revisions (including create-revision, mod-revisions and revision in header) make perfect sense to me. Note:
@aphyr Please let me know if you have any comments or concern. |
I have many concerns. At the risk of repeating myself, the cycle is right here: |
Sure. You’ll find a detailed explanation of how Elle finds cycles via two properties we call recoverability and traceability in the Elle paper, also linked in the Elle README. Pardon my brevity here: my computer is in pieces as I prepare for moving. We perform transactions comprised of unguarded writes and reads across a rotating pool of keys with exponentially distributed frequency. Choosing unique writes allows recoverability. We infer constraints on the version graph for each key using the known initial version |
@ahrtr I think there are multiple assumptions made in Jepsen that are not obvious. Each write writes a unique value, so 2 transactions Write returned revision number @aphyr thanks for context on high level approach. I think I have some basic level of understanding. What I'm missing is implementation details like translating Jepsen terms Would love to get dump of etcd requests/responses in easy to parse format. With them I should be able to validate what exactly happend. Maybe I missed them in the report, so could you maybe point them out? |
Thanks a lot for your work, Kyle. I've just briefly browsed the etcd logs from your s3 bucket. One thing that I found strange would be on n5, where it says:
and it constantly complains about this on every snapshot. The other nodes do not report any of that. Aside that, the T4 read that was found loopy ended up on n4, which during that time was actually restarted with one other node. So I wouldn't rule out a split brain. Is there a way to visualize this a bit better? Maybe have the read/write plots also include which member they targeted? |
I still do not see any issue. Please see explanation below, Key: 1756, value: 27 (under the G1c)Just as I mentioned previously, the TXN in process 73 happened before the TXN in process 106. Note the same value Key: 1754, value: 11 (under the G1c)Process Key: 277, value: 1/4/11 (under the Lost Update)There are no any data loss. All the historical values are persisted in the db file. The reason why the second TXN still read value
@aphyr Please let me know if you still have any concerns, and I am happy to resolve all your concerns. |
When Process 73 was writing the value, the version of the key was 15, and the value was 13. See #14890 (comment) If you look at the etcd dump data, the version of the key should be 18 rather than 15. I am unsure if the graph you created is wrong or if the actual reply was problematic. But there is something wrong... Maybe @aphyr (I am no longer on the maintainer list :P, so do not expect me to be very responsive) can clarify if Jepsen would write duplicate values to the same key. If the assumption is no, then either Jepsen is buggy (which is unlikely as a very simple producer...), or etcd does double writes somehow. |
Thanks for pointing this out. Yes, it's a typo. The version should be 18 instead of 15. @aphyr this doesn't affect my previous comment. Please still let me know if you have any concern.
|
What is the status of this? Is it an actual problem? I can't quite tell from the comments :) |
Based on my previous comment #14890 (comment), I don't see any issue. I do not see any new feedback from the reporter @aphyr either. I will close this ticket if no any further feedback is provided in the following two weeks. |
I have done an extensive local testing of the same scenario using linearizability tests. I was not able reproduce the issue even once. Possibly it's still a limitation of the framework, however I'm starting to suspect that it's a bug in Jepsen etcd client. I plan to send a PR that adds this scenario for nightly linearizability tests that run for ~1h every day. |
I'm not sure how else I can convince y'all that these are serious issues, and the fact that etcd's maintainers do not see anything wrong here has me dismayed. Transactions in a serializable system are not supposed to observe transactions which did not succeed, to read their own effects, or to apparently diverge into multiple timelines. The entire point of etcd is to ensure linearizability and serializability, but after months of experimentation I cannot figure out how to get etcd to do those things. I think it very likely that either etcd or jetcd is badly broken. Either of these things is cause for alarm. I've given up on this. I've put months of unpaid work into these behaviors and written a full report, but the entity which sponsored some of this work earlier last year (and allowed me to file some of these tickets) opted not to allow Jepsen to publicly acknowledge their support. Per Jepsen's ethics policy I'm forced to shelve the report. :-/ |
@aphyr I share your concern. I think the problem is lack of maintainer familiarity with tools you used making it impossible for us to investigate. Can you send an email to [email protected] so we can discuss options including creating another report for etcd? It's a private mailing group for etcd maintainers. |
@aphyr e.g. the "cycle" in #14890 (comment) seems to be completely explained by the different versions of the key 1756 -- it had the same value at two different versions (a version is a point in time), and so it's not actually a cycle. It sounds like you think the tool wouldn't write the same value twice? Maybe there is some other problem that confused the tool, but if there were a time-travel cycle the version numbers would line up, and they don't. Personally I found the lost updates to be a drastically more plausible bug, a priori, but it seems like no discussion focused on that half of the report? |
As I have already explained, this test never writes the same value twice. |
Thanks @aphyr for the feedback.
There are knowledge gaps, etcd maintainers do not know much about Jepsen, while Jepsen engineers do not know much about etcd. It's exactly the reason to make this conversation harder.
Based on my comment #14890 (comment) (note there is a typo see #14890 (comment)), I do not see any issue on etcdserver side. So I suspect there might be some issue on jetcd or Jepsen test cases. But unfortunately, just as I mentioned above, the existing etcd maintainers do not know much about Jepsen. I doubt that the jetcd is well maintained by existing etcd maintainer either.
I am thinking probably we can raise a ticket to CNCF to extend the Jepsen support for some time? Anyway, I would like to get this ticket clearly clarified. cc @ptabor @spzala @serathius |
cc @lburgazzoli Ideally we can get etcd maintainer, jetcd maintainer and Jepsen engineer to work together to get this clarified. |
I mean, seeing the same value with different versions is extremely strong evidence that it did get written twice; the other cause I can think of is data corruption but that would be much more likely to just render it unreadable. Is Jepsen open source? I can't seem to find the source with a quick search. I think the most likely explanation is that Jepsen attempted a write, it believed the write failed and therefore it was free to re-use the value, but etcd actually recorded the write. (e.g. maybe network failure during the response) |
Hi, I recently investigated these issues as they relate to Jepsen tests of our database. I discovered that the cyclic information flow anomaly was caused by the automatic retry mechanism of the etcd client. Describe In DetailFrom Jepsen's perspective, the cyclic history is as follows:
Let's denote the first txn as Examining the actual history on etcd (represented in Jepsen history style), we could observe:
This history is valid, but we could see that Further investigation into the Jepsen test history reveals the following events:
The header revision As a result, the etcd client proceeded with a retry at revision 26052, causing
ConclusionI believe this is an etcd issue, because the client retry is transparent to Jepsen. To address this issue, please consider implementing exactly-once semantics for etcd. |
|
OMG, huge thanks @bsbds from finding this and informing us that it might be false positive! |
Thanks @bsbds for the feedback. I did not fully follow the details you provided, but high level seems reasonable, and it aligns with what I analysed previously. etcd client sdk indeed has retry logic, see below, Line 273 in d6c0127
A couple of personal thoughts,
|
I'm fairly sure the false positive @bsbds reported reflects a broken database or test harness, not a problem in Elle. |
IMO, most likely it's the latter (test harness) or jetcd. I don't see any etcdserver side issue so far; There is clear/strong evidence that the same
As mentioned in my previous comment, etcd client sdk indeed has retry logic in the grpc-go interceptor for unary RPCs (see code snippets below). But my understanding is that it's sync invocation. It means the retrying is transparent to client, once the client gets a response (no matter it's success or failure), then the interceptor won't do anything anymore. @dfawley could you please double confirm this? If confirmed, is there any higher level retry in Jepsen side (e.g. retry if timedout)? Line 273 in 9f58999
etcd/client/v3/retry_interceptor.go Lines 40 to 51 in 9f58999
But again, Jepsen uses etcd java client sdk, so we need to analyze jetcd's behaviour. cc @lburgazzoli |
It turns out the lost update is not a problem in Elle, but also a client retry issue. The issue arises because the following command was executed twice by the server.
In the Jepsen history, this would appear as two transactions reading the same value but updating it to different values. |
What's the "server"? etcdserver or part of Jepsen test? |
The etcd server. For the same reason, it get killed before returning the result. |
To be clearer, there are two issues in this ticket (see the description of this issue). We kept discussing the " If I understood your comment correctly, you mean that there is a race condition between multiple concurrent requests, which was caused by client retry mechanism? Theoretically, it's possible (see example below). But I don't think it's etcdsever issue, nor etcd client sdk issue. Instead, it's application's responsibility to take care of it (e.g. avoid overwrite the same key using a TXN similar to optimisticTxn ). Let's work with an example/workflow,
|
Yes, it's a race condition caused by the client retry. In your example, client 1 sent two requests in total and they both got committed and persistent by the etcdserver. However, only the second response is observed by Jepsen. So from Jepsen's perspective, the etcdserver committed a command it never requested. I agree that it's not an etcdserver issue, but the client SDK must disable automatic retry for non-idempotent operations. |
If you understood my previous comment #14890 (comment) correctly, the retry is transparent to users (e.g. Jepsen). Does Jepsen retry before it get a response from previous request? Also as mentioned in #14890 (comment), we might need to expose an interface to let users decide to enable/disable the retry. |
I think we have enough evidence to think that problem is related to Java client and not on etcdserver side. With that I think we can close the investigation, leaving followups to separate issues. If there is still any doubt, please feel free to continue the discussion and reopen the issue, however at the current moment I think we can official reject Jepsen report as inaccurate. |
Agreed.
The problem might be related to etcd java client or Jepsen. Based on all the discussion so far, I would bet Jepsen. |
It's jetcd. https://jepsen.io/analyses/jetcd-0.8.2 |
Looks good to me. The I will raise a new issue to track the client retry issue.
|
Read #18424 |
What happened?
In Jepsen tests of etcd 3.5.5 with process crashes, running in five node clusters on Debian Bullseye, the wr workload (which performs transactions which read and write keys without any guards) appears to exhibit both lost update and cyclic information flow.
Lost Update
In this test run, we executed the following pair of transactions:
Both of these transactions read key 277 = 1 at revision 3925, and wrote key = 277--one set it to 4, the other to 11. This constitutes a lost update: under snapshot isolation (and serializability) at most one of these transactions ought to have committed.
G1c (cyclic information)
In this test run, we executed the following pair of transactions.
Process 73 performed a transaction which set key 1754 to 11. That write was read by process 106, which in its transaction set key 1756 to 27. That write of 27 was in turn visible to process 73's transaction. Per the etcd headers for the two transactions, process 73's transaction executed at revision 25932, and observed 1756 at version 15, mod revision 25580. Process 106's txn executed at revision 26052, and observed key 1754's value of 11 at version 19, mod revision 25932. This seems impossible--perhaps a sign of split-brain?
What did you expect to happen?
Neither of these phenomena should be possible under serializability.
How can we reproduce it (as minimally and precisely as possible)?
Check out jepsen.etcd e273986b2248ea00085e398a3bb0a6991d701d20 and run
Violations of strict serializability (e.g. G-single-realtime) are fairly frequent--on my cluster, they take only a few minutes to show up. Violations of serializability (e.g. G1c, lost update) are less common; might take a few hours.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: