Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix deadlock in the transaction code. #956

Merged
merged 1 commit into from
Mar 17, 2023
Merged

Conversation

romange
Copy link
Collaborator

@romange romange commented Mar 17, 2023

The deadlock happenned during the brpop flow where we access shard_data.local_data from both coordinator and shard threads. Originally, shard_data.local_data was not designed for concurrent access so I used ARMED bit to deduplicate callback runs for each shard. The problem is that with BRPOP it could happen, the ExecuteAsync would do "=| ARMED" and in parallel NotifySuspended would apply " |= AWAKED" in the shard thread, and both R/M/W operations would corrupt each other.

Therefore, I separated now completely shard-local local_data mask and is_armed boolean. Moreover, since now we use atomics for is_armed, I increased PerShardData size to 64 bytes to avoid false cache sharding betweenn PerShardData objects.

Fixes #945

@romange romange requested a review from dranikpg March 17, 2023 10:36
@romange romange force-pushed the brpop-crash-repro branch 3 times, most recently from f529348 to ff36512 Compare March 17, 2023 15:51
dranikpg
dranikpg previously approved these changes Mar 17, 2023
src/server/transaction.cc Outdated Show resolved Hide resolved
src/server/transaction.cc Outdated Show resolved Hide resolved
src/server/transaction.h Show resolved Hide resolved
src/server/transaction.h Show resolved Hide resolved
The deadlock happenned during the brpop flow where we access
shard_data.local_data from both coordinator and shard threads.
Originally, shard_data.local_data was not designed for concurrent access,
and I used ARMED bit to deduplicate callback runs for each shard.
The problem is that within BRPOP flow, the
ExecuteAsync would apply "=| ARMED" and in parallel NotifySuspended would apply
" |= AWAKED" in the shard thread, and both R/M/W operations would corrupt each other.

Therefore, I separated now completely shard-local local_data mask and is_armed boolean.
Moreover, since now we use atomics for is_armed, I increased PerShardData size to 64 bytes
to avoid false cache sharding betweenn PerShardData objects.

Fixes #945

Signed-off-by: Roman Gershman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

brpop does not work as expected
2 participants