-
Notifications
You must be signed in to change notification settings - Fork 629
[CBR-424] Fix deadlock triggered by update/shutdown #3731
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the requested comment, LGTM
@@ -335,7 +336,20 @@ streamBlocks logTrace smM logic streamWindow enqueue nodeId tipHeader checkpoint | |||
processBlocks batchSize n' (block : blocks) blockChan | |||
|
|||
writeStreamEnd :: Conc.TBQueue StreamEntry -> IO () | |||
writeStreamEnd blockChan = atomically $ Conc.writeTBQueue blockChan StreamEnd | |||
writeStreamEnd blockChan = writeBlock 1024 blockChan StreamEnd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just add a comment that a reader might stop and the queue might fill so we are dealing with this case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting bug. I wonder, why can't GHC detect the deadlock and kill with a blocked idefinitely on transaction exception? Do we still keep all of our mutable state things in a reader context? I'll bet if we didn't, GHC would detect the deadlock.
writeStreamEnd blockChan = writeBlock 1024 blockChan StreamEnd | ||
|
||
writeBlock :: Int -> Conc.TBQueue StreamEntry -> StreamEntry -> IO () | ||
writeBlock delay _ _ | delay >= 4000000 = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little opaque. Why 4000000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4000000 corresponds to a 4 seconds delay, we need a limit for when we give up, and the time we sleep doubles after every time we try so a 4 second limit seemed reasonable.
@@ -348,7 +362,7 @@ streamBlocks logTrace smM logic streamWindow enqueue nodeId tipHeader checkpoint | |||
requestBlocks :: Conc.TVar Bool -> Conc.TBQueue StreamEntry -> IO (Conc.TVar (OQ.PacketStatus ())) | |||
requestBlocks fallBack blockChan = do | |||
convMap <- enqueue (MsgRequestBlocks (S.singleton nodeId)) | |||
(\_ _ -> (Conversation $ \it -> requestBlocksConversation blockChan it `finally` writeStreamEnd blockChan) :| | |||
(\_ _ -> (Conversation $ \it -> requestBlocksConversation blockChan it `onException` writeStreamEnd blockChan) :| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's essential to use onException
rather than finally
, then we should put a comment explaining it. Why can requestBatch
still use finally
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requestBatch
can get away with finally
because the only thing it will write is StreamEnd
for which there will always be room. I will add a comment.
A new update will cause the block retrival worker to exit which means that the streaming thread could end up blocking for ever when attempting to write to the full block queue. Update so that we limit the time that we the streaming thread will wait on the block queue incase of an exception.
3270849
to
b319a2a
Compare
A new update will cause the block retrival worker to exit
which means that the streaming thread could end up blocking
for ever when attempting to write to the full block queue.
Update so that we limit the time that we the streaming thread will
wait on the block queue incase of an exception.
Linked issue
https://iohk.myjetbrains.com/youtrack/issue/CBR-424
Type of change
Developer checklist
Testing checklist
QA Steps
How to reproduce the fault, and test the fix is provided in the linked issue above. The fix is only verifiable with
--legacy-wallet
since the new wallet doesn't handle update/shutdown.Screenshots (if available)