Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Avoid deadlock between competing pushers #5710

Merged
merged 2 commits into from
Mar 31, 2016

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 30, 2016

When two transactions try to push each other, one of them must be able to win
in order to avoid deadlock. This was (seemingly) achieved by, in that order,
using priority, timestamp and transaction ID as criterions.
In particular, when priorities were equal, we chose to let the transaction with
the lower timestamp win based on seniority.
The issue with that lies in the asynchronicity between the active KV client and
the persisted record. While the record is the source of truth, the client may
actually operate at a higher timestamp and will supply that higher timestamp
with its Pushes. On the other hand, when being pushed, the timestamp used is
that of the record, which may be smaller.

Consequently, the following situation could occur:

  • Transactions A and B end up with the same priority, B has the larger ID. This
  • can easily happen if they both lose against a higher-priority transaction[1].
  • Based on their persisted record's timestamps, B should lose: it has the higher
    timestamp.
  • A operates at a timestamp which is higher than that of its persisted record,
  • and also higher than that of B.
  • B operates at its true timestamp (for simplicity).

Consequently, since priorities are equal,

  • A can't push B: it uses its client timestamp, which is higher than that of
    B's record.
  • B can't push A: it uses its client=real timestamp, but that can't win against
    A's persisted record (which is lower than B's).

A simple alternative fix would have been to let the larger timestamp win, but
this is also dangerous (though it should ultimately work): The persisted
timestamp can also be larger than the active timestamp; this commonly happens
when a transaction gets pushed but does not know it yet. We update the
transaction entry after a failed Push, so this scenario would resolve in
practice, but it demonstrates that the lack of synchronicity leads to
surprising situations which are best avoided.

Fixes #5685 (though low throughput still happens, I'll investigate this more).

[1]: Independently from this change, we may want to reevaluate this strategy at
some point. If A at priority 1 and B at priority 2^16 both lose against
priority 2^25, it does not seem fair that both should end up at priority
2^25-1. Instead, it would seem more reasonable that their priority be increased
as a fraction of the difference, for instance bumping A by (2^25-1)/2 and B by
(2^25-2^16)/2, which preserves the relative ordering of A and B but still lets
them approach the priority of their mutual enemy.


This change is Reviewable

@tbg tbg changed the title Txn order storage: Avoid deadlock between competing pushers storage: Avoid deadlock between competing pushers Mar 30, 2016
@vivekmenezes
Copy link
Contributor

Perhaps this is unrelated but using the notion of priority seems wrong.
I've often wondered why we don't replace it with 1. start-timestamp and 2.
user defined priority. I like the use of those two variables because they
better describe what a user cares about. 1. Preventing starvation and 2.
Getting important work done. I know we already have user defined priority.
With this model when a transaction gets pushed, its start time keeps moving
back from the current time and that allows us to schedule it before other
fresher transactions.

Thanks

On Wed, Mar 30, 2016 at 10:15 AM Tobias Schottdorf [email protected]
wrote:

When two transactions try to push each other, one of them must be able to
win
in order to avoid deadlock. This was (seemingly) achieved by, in that
order,
using priority, timestamp and transaction ID as criterions.
In particular, when priorities were equal, we chose to let the transaction
with
the lower timestamp win based on seniority.
The issue with that lies in the asynchronicity between the active KV
client and
the persisted record. While the record is the source of truth, the client
may
actually operate at a higher timestamp and will supply that higher
timestamp
with its Pushes. On the other hand, when being pushed, the timestamp used
is
that of the record, which may be smaller.

Consequently, the following situation could occur:

  • Transactions A and B end up with the same priority, B has the larger
    ID. This
  • can easily happen if they both lose against a higher-priority
    transaction[1].
  • Based on their persisted record's timestamps, B should lose: it has
    the higher timestamp.
  • A operates at a timestamp which is higher than that of its persisted
    record,
  • and also higher than that of B.
  • B operates at its true timestamp (for simplicity).

Consequently, since priorities are equal,

  • A can't push B: it uses its client timestamp, which is higher than
    that of B's record.
  • B can't push A: it uses its real timestamp, which is higher than
    that of B's record.

A simple alternative fix would have been to let the larger timestamp win,
but
this is also dangerous (though it should ultimately work): The persisted
timestamp can also be larger than the active timestamp; this commonly
happens
when a transaction gets pushed but does not know it yet. We update the
transaction entry after a failed Push, so this scenario would resolve in
practice, but it demonstrates that the lack of synchronicity leads to
surprising situations which are best avoided.

Fixes #5685 #5685
(though low throughput still happens, I'll investigate this more).

[1]: Independently from this change, we may want to reevaluate this
strategy at
some point. If A at priority 1 and B at priority 2^16 both lose against
priority 2^25, it does not seem fair that both should end up at priority
2^25-1. Instead, it would seem more reasonable that their priority be
increased
as a fraction of the difference, for instance bumping A by (2^25-1)/2 and
B by
(2^25-2^16)/2, which preserves the relative ordering of A and B but still
lets

them approach the priority of their mutual enemy.

You can view, comment on, or merge this pull request online at:

#5710
Commit Summary

  • storage: Avoid deadlock between competing pushers
  • storage: Stop keeping Pushees alive

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#5710

@tamird
Copy link
Contributor

tamird commented Mar 30, 2016

"timestamp and transaction ID" missing comma?

"* can easily happen if t..." shouldn't have a bullet on this line

"* and also higher than that of B." shouldn't have a bullet on this line

and the first commit message doesn't describe the actual fix =/

:lgtm:


Reviewed 3 of 3 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 6 unresolved discussions.


storage/replica_command.go, line 1090 [r1] (raw file):
interesting style


storage/replica_command.go, line 1091 [r1] (raw file):
this thing might read nicer as a switch


storage/replica_command.go, line 1124 [r1] (raw file):
different tenses? should be "pushed" or "fails to push" below.


storage/replica_command.go, line 996 [r2] (raw file):
it would be good to include some context in this message (i.e. args or a fragment of it)


storage/replica_command.go, line 1083 [r2] (raw file):
is there any reason we can't init LastHeartbeat to OrigTimestamp when we first write the record? seems really weird for it to be nullable.


util/uuid/uuid.go, line 36 [r1] (raw file):
all this logic is already on Transaction.String() - i think you can remove all this


Comments from the review on Reviewable.io

@tamird
Copy link
Contributor

tamird commented Mar 30, 2016

Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


util/uuid/uuid.go, line 36 [r1] (raw file):
err Transaction.Short()


Comments from the review on Reviewable.io

@bdarnell
Copy link
Contributor

LGTM


Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


storage/replica_command.go, line 1080 [r2] (raw file):
Why name the return values here? In most cases it would be clearer to use a string literal on the return line like return true, "pushee is expired" (especially for cases like the PUSH_TOUCH check, where it's not obvious whether reason is empty or has inherited a value from earlier)


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Mar 30, 2016

@vivekmenezes maybe. Wanna file the above and include the footnote from my commit message as well?
@tamird: commit message updated.


Review status: 2 of 3 files reviewed at latest revision, 7 unresolved discussions.


storage/replica_command.go, line 1091 [r1] (raw file):
Done. Feels a bit more brittle, but looks better. Preferable overall, I think.


storage/replica_command.go, line 1124 [r1] (raw file):
Changed to simple past.


storage/replica_command.go, line 996 [r2] (raw file):
Is that useful? After all, the caller gets the error, so it should log it. This path amounts to a bug, and Pushes only happen internally, so it should be trivial to track down should it ever happen.


storage/replica_command.go, line 1083 [r2] (raw file):
I thought about that too, but that's a migration, so I won't touch it in this PR (which doesn't need a migration path).


util/uuid/uuid.go, line 36 [r1] (raw file):
I have no transaction where I call this (only a TxnMeta), and I want to make sure that printing an empty UUID doesn't crash (which it would before this change).


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Mar 30, 2016

Review status: 2 of 3 files reviewed at latest revision, 6 unresolved discussions.


storage/replica_command.go, line 1080 [r2] (raw file):
This is now a switch.


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Mar 30, 2016

Review status: 2 of 3 files reviewed at latest revision, 4 unresolved discussions.


storage/replica_command.go, line 1083 [r2] (raw file):
Ah, it's probably not a migration. Anyway, best done separately.


Comments from the review on Reviewable.io

@tamird
Copy link
Contributor

tamird commented Mar 30, 2016

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 3 unresolved discussions.


storage/replica_command.go, line 996 [r2] (raw file):
This error looks like it's on the way out to the RPC system, where things get lost and become hard to match back to a call. I don't thinking logging something self-contained would hurt.


storage/replica_command.go, line 1083 [r2] (raw file):
I don't think it's a migration - nullable/not nullable doesn't affect serialization.


util/uuid/uuid.go, line 36 [r1] (raw file):
Transaction embeds TxnMeta, so you should move Short to TxnMeta instead of adding it here. As for printing a nil UUID - this is also handled in Short().


Comments from the review on Reviewable.io

@tamird
Copy link
Contributor

tamird commented Mar 30, 2016

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


storage/replica_command.go, line 996 [r2] (raw file):
looks like this is done elsewhere as well. fine as is.


Comments from the review on Reviewable.io

@tbg
Copy link
Member Author

tbg commented Mar 30, 2016

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


util/uuid/uuid.go, line 36 [r1] (raw file):
Did some moving around (tldr: only use uuid.Short()), see new commit.


Comments from the review on Reviewable.io

@vivekmenezes
Copy link
Contributor

#5727

@tamird
Copy link
Contributor

tamird commented Mar 30, 2016

Reviewed 1 of 1 files at r4, 6 of 6 files at r5.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from the review on Reviewable.io

@tamird
Copy link
Contributor

tamird commented Mar 30, 2016

Reviewed 6 of 6 files at r6, 6 of 6 files at r7.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from the review on Reviewable.io

tbg added 2 commits March 30, 2016 21:47
When two transactions try to push each other, one of them must be able to win
in order to avoid deadlock. This was (seemingly) achieved by, in that order,
using priority, timestamp, and transaction ID as criterions.
In particular, when priorities were equal, we chose to let the transaction with
the lower timestamp win based on seniority.
The issue with that lies in the asynchronicity between the active KV client and
the persisted record. While the record is the source of truth, the client may
actually operate at a higher timestamp and will supply that higher timestamp
with its Pushes. On the other hand, when being pushed, the timestamp used is
that of the record, which may be smaller.

Consequently, the following situation could occur:

* Transactions A and B end up with the same priority, B has the larger ID. This
can easily happen if they both lose against a higher-priority transaction[1].
* Based on their persisted record's timestamps, B should lose: it has the higher
timestamp.
* A operates at a timestamp which is higher than that of its persisted record,
and also higher than that of B.
* B operates at its true timestamp (for simplicity).

Consequently, since priorities are equal,
* A can't push B: it uses its client timestamp, which is higher than that of
B's record.
* B can't push A: it uses its client=real timestamp, but that can't win against
A's persisted record (which is lower than B's).

This changes fixes this by removing the timestamp from the equation. The larger
priority wins, and the greater transaction ID otherwise. There is priority skew
as well, but we have already broken this by letting the Txn update from its
persisted state when it fails to Push in a prior change.

A simple alternative fix would have been to let the larger timestamp win, but
this is also dangerous (though it should ultimately work): The persisted
timestamp can also be larger than the active timestamp; this commonly happens
when a transaction gets pushed but does not know it yet. We update the
transaction entry after a failed Push, so this scenario would resolve in
practice, but it demonstrates that the lack of synchronicity leads to
surprising situations which are best avoided.

Fixes cockroachdb#5685 (though low throughput still happens, I'll investigate this more).
Also contains a drive-by fix: Use OrigTimestamp as the basis of the heartbeat
timeout computation. An abandoned transaction which gets pushed continuously
could otherwise be kept alive erroneously (since its timestamp would keep
increasing). The fact that this timestamp was persisted to disk exacerbated
this: the GC queue would not be able to tell that this was an abandoned record
as well. This has not happened in practice.

[1]: Independently from this change, we may want to reevaluate this strategy at
some point. If A at priority 1 and B at priority 2^16 both lose against
priority 2^25, it does not seem fair that both should end up at priority
2^25-1. Instead, it would seem more reasonable that their priority be increased
as a fraction of the difference, for instance bumping A by (2^25-1)/2 and B by
(2^25-2^16)/2, which preserves the relative ordering of A and B but still lets
them approach the priority of their mutual enemy.
Removed txn.Short() (for uuid.Short()) and txn.GetMeta().
@tbg tbg merged commit 78ec639 into cockroachdb:master Mar 31, 2016
@tbg tbg deleted the txn_order branch March 31, 2016 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stability: photos ops/sec drops to and stays at 0
4 participants