Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Fix light client deadlock #9385

Merged
merged 2 commits into from
Sep 4, 2018
Merged

Conversation

cheme
Copy link
Contributor

@cheme cheme commented Aug 21, 2018

This PR is fixing deadlock for #8918

It avoids some recursive calls on light_sync by making state check optional for Informant.

The current behavior is to display the information when informant checks if block is major version.
This change a bit the informant behavior, but not on most cases.

To remember where and how this kind of deadlock are likely to happen (not seen with Parkinglot deadlock detection because it uses std condvar), I am adding a description of the deadlock.
Also, for the reviewers there may be better solution than modifying the informant.

Thread1

  • ethcore/sync/light_sync/mod.rs

A call to the light handler through any Io (having a loop of rpc query running on like client makes the dead lock way more likely).
At the end of those calls we systematically call maintain_sync method.

Here maintain_sync locks state (it is the deadlock cause), with a write purpose

maintain_sync -> begin_search with the state locked open

begin_search -> lightcliennt flush_queue method

  • ethcore/light/src/client/mod.rs

flush_queue -> flush on queue (HeaderQueue aka VerificationQueue of headers)

  • ethcore/src/verification/queue/mod.rs

Condition there is some unverified or verifying content

flush wait on a condvar until the queue is empty. The only way to unlock condvar is that worker is empty and unlock it (so thread 2 is Verification worker).

Thread2

A verification worker at the end of a verify loop (new block).

  • ethcore/src/verification/queue/mod.rs

thread loops on verify method.

End of loop condition is_ready -> Import the block immediately

calls set_sync on QueueSignal which send a BlockVerified ClientIoMessage in inner channel (IoChannel of ClientIoMessage) using send_sync

  • util/io/src/service_mio.rs

IoChannel send_sync method calls all handlers with message method; one of the handlers is ImportBlocks IoHandler (with a single inner Client service field)

  • ethcore/light/src/client/service.rs

message trigger inner method import_verified

  • core/light/src/client/mod.rs

import_verified at the very end notify the listeners of a new_headers, one of the listeners is Informant listener method

  • parity/informant.rs

newHeaders run up to call to is_major_importing on its target (again clinet)

  • ethcore/sync/src/light_sync/mod.rs

Here is_major_importing tries to get state lock (read purpose only) but cannot because of previous state lock, thus deadlock

@parity-cla-bot
Copy link

It looks like @cheme signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

@cheme cheme requested a review from andresilva August 21, 2018 08:08
@cheme cheme added the A0-pleasereview 🤓 Pull request needs code review. label Aug 21, 2018
@andresilva andresilva added the M4-core ⛓ Core client code / Rust. label Aug 21, 2018
Copy link
Contributor

@andresilva andresilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on finding this. For the Informant case it's ok to just return false from is_major_importing since it will possibly just delay the printing of the import message.

But the underlying issue is still present and could happen again in the future through other listeners. I think we should either have more fine-grained locking on state, or make sure the calls to the various ChainNotify subscribers are handled outside of the state lock (this doesn't seem too easy to do).

I'm not very familiar with this code so it's possible my suggestions don't make much sense. I'm also OK with merging this right now since it's an easy fix.

@@ -644,7 +644,14 @@ pub trait SyncInfo {
fn start_block(&self) -> u64;

/// Whether major sync is underway.
fn is_major_importing(&self) -> bool;
#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rustc probably handles this already. I don't have a strong opinion on it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither, I am just used to add those (I never really checked). I believe there is not a lot in parity code so I will remove it (it feels reasonable to rely on the compiler :)


/// Whether major sync is underway.
/// When not waiting on the state we return false by default
fn is_major_importing_do_wait(&self, wait: bool) -> bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to add this method to the trait since it exposes an implementation detail of LightSync. Just adding it to LightSync should be fine since that's also the type we use in Informant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not like to add this method in the trait, but in fact Informant (the light one) uses the trait, seems to be a design choice to only have read_only method accessible to the informant (alas read_only does not mean no deadlock). I will rename the trait method to make it a bit less technical.

}

/// Whether major sync is underway.
/// When not waiting on the state we return false by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When acquiring the state lock would block we return false.?

@cheme
Copy link
Contributor Author

cheme commented Aug 21, 2018

I concur with your analysis, the underlying issue is still present, it is not easy to get a global view over all the lock used; especially with listeners and io_handlers (the sync nature of their calls makes it very likely to trigger others deadlock with future change). Ideally we should ensure there is no circular reference, and maybe use asynch queue to avoid some. Issue like #9114 (circular reference) is definitely present in the light client at some places.

@5chdn 5chdn added this to the 2.1 milestone Aug 23, 2018
@andresilva andresilva merged commit c1aed4a into openethereum:master Sep 4, 2018
@andresilva andresilva added A8-looksgood 🦄 Pull request is reviewed well. B0-patchthis 🕷 and removed A0-pleasereview 🤓 Pull request needs code review. labels Sep 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants