Possible aborted reads with process crashes #1072

aphyr · 2022-06-14T13:35:32Z

Versions

etcd: 3.5.3
jetcd: 0.7.1
java: openjdk version "17.0.3" 2022-04-19 (Debian)

Describe the bug

I've got a Jepsen test case for etcd which appears to show something like an aborted read in response to process crashes in five-node clusters running on Debian stable. A Jetcd client submits a transaction with a modified-revision compare clause on some number of keys, and a success branch which performs some writes. Occasionally etcd returns a TxnResponse which (per wireshark) does not contain a succeeded field. When we ask Jetcd for TxnResponse.isSucceeded, it returns false; a reasonable client would assume that this transaction did not execute its successful branch. However, the effects of the successful branch are visible to later reads. If this occurred in an SQL database (and we treated succeeded for a txn with only a success branch as meaning commit/abort), I'd be inclined to call this an aborted read.

I've filed this in detail on the main etcd repo, but they've repeatedly informed me this must be a client issue--either something in jetcd or in the Jepsen test itself. I'd be delighted to find out this is the case, but based on the wireshark disassembly, I can't see how this could be the client's fault: at the wire level, I can't find any way to distinguish these "not-succeeded-but-actually-succeeded" transactions from "not-succeeded-and-actually-not-succeeded" ones. I was hoping you might be able to share some insight!

To Reproduce

Clone https://github.com/jepsen-io/etcd at a1bf380a1c09d62bf6bf2e7b97bd02a35902ed36, and run:

lein run test-all -w append --concurrency 2n --time-limit 1000 --rate 1000 --test-count 5 --nemesis kill

Expected behavior

I expect that transactions which return TxnResponse.isSucceeded() = false would not, in fact, appear to execute their success branches.

Additional context

The text was updated successfully, but these errors were encountered:

lburgazzoli · 2022-06-14T13:50:07Z

Honestly it's been long time since I looked at the transaction code so it is quite difficult to me to give any hint at this stage. I'd be happy to include any patch in case it is an issue in jetcd.

github-actions · 2022-08-14T01:58:36Z

This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.

github-actions bot added the stale label Aug 14, 2022

github-actions bot closed this as completed Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible aborted reads with process crashes #1072

Possible aborted reads with process crashes #1072

aphyr commented Jun 14, 2022 •

edited

Loading

lburgazzoli commented Jun 14, 2022

github-actions bot commented Aug 14, 2022

Possible aborted reads with process crashes #1072

Possible aborted reads with process crashes #1072

Comments

aphyr commented Jun 14, 2022 • edited Loading

lburgazzoli commented Jun 14, 2022

github-actions bot commented Aug 14, 2022

aphyr commented Jun 14, 2022 •

edited

Loading