-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Conversation
It looks like @cheme signed our Contributor License Agreement. 👍 Many thanks, Parity Technologies CLA Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job on finding this. For the Informant case it's ok to just return false
from is_major_importing
since it will possibly just delay the printing of the import message.
But the underlying issue is still present and could happen again in the future through other listeners. I think we should either have more fine-grained locking on state, or make sure the calls to the various ChainNotify
subscribers are handled outside of the state lock (this doesn't seem too easy to do).
I'm not very familiar with this code so it's possible my suggestions don't make much sense. I'm also OK with merging this right now since it's an easy fix.
ethcore/sync/src/light_sync/mod.rs
Outdated
@@ -644,7 +644,14 @@ pub trait SyncInfo { | |||
fn start_block(&self) -> u64; | |||
|
|||
/// Whether major sync is underway. | |||
fn is_major_importing(&self) -> bool; | |||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rustc
probably handles this already. I don't have a strong opinion on it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me neither, I am just used to add those (I never really checked). I believe there is not a lot in parity code so I will remove it (it feels reasonable to rely on the compiler :)
ethcore/sync/src/light_sync/mod.rs
Outdated
|
||
/// Whether major sync is underway. | ||
/// When not waiting on the state we return false by default | ||
fn is_major_importing_do_wait(&self, wait: bool) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to add this method to the trait since it exposes an implementation detail of LightSync
. Just adding it to LightSync
should be fine since that's also the type we use in Informant
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not like to add this method in the trait, but in fact Informant (the light one) uses the trait, seems to be a design choice to only have read_only method accessible to the informant (alas read_only does not mean no deadlock). I will rename the trait method to make it a bit less technical.
ethcore/sync/src/light_sync/mod.rs
Outdated
} | ||
|
||
/// Whether major sync is underway. | ||
/// When not waiting on the state we return false by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When acquiring the state lock would block we return false.
?
I concur with your analysis, the underlying issue is still present, it is not easy to get a global view over all the lock used; especially with listeners and io_handlers (the sync nature of their calls makes it very likely to trigger others deadlock with future change). Ideally we should ensure there is no circular reference, and maybe use asynch queue to avoid some. Issue like #9114 (circular reference) is definitely present in the light client at some places. |
This PR is fixing deadlock for #8918
It avoids some recursive calls on light_sync by making state check optional for Informant.
The current behavior is to display the information when informant checks if block is major version.
This change a bit the informant behavior, but not on most cases.
To remember where and how this kind of deadlock are likely to happen (not seen with Parkinglot deadlock detection because it uses std condvar), I am adding a description of the deadlock.
Also, for the reviewers there may be better solution than modifying the informant.
Thread1
A call to the light handler through any Io (having a loop of rpc query running on like client makes the dead lock way more likely).
At the end of those calls we systematically call
maintain_sync
method.Here maintain_sync locks
state
(it is the deadlock cause), with a write purposemaintain_sync
->begin_search
with the state locked openbegin_search
-> lightclienntflush_queue
methodflush_queue
->flush
on queue (HeaderQueue aka VerificationQueue of headers)Condition there is some unverified or verifying content
flush
wait on a condvar until the queue is empty. The only way to unlock condvar is that worker is empty and unlock it (so thread 2 is Verification worker).Thread2
A verification worker at the end of a verify loop (new block).
thread loops on
verify
method.End of loop condition is_ready -> Import the block immediately
calls
set_sync
on QueueSignal which send a BlockVerified ClientIoMessage in inner channel (IoChannel of ClientIoMessage) usingsend_sync
IoChannel
send_sync
method calls all handlers withmessage
method; one of the handlers is ImportBlocks IoHandler (with a single inner Client service field)message
trigger inner methodimport_verified
import_verified
at the very end notify the listeners of a new_headers, one of the listeners is Informantlistener
methodnewHeaders
run up to call tois_major_importing
on its target (again clinet)Here
is_major_importing
tries to get state lock (read purpose only) but cannot because of previous state lock, thus deadlock