[CBR-424] Fix deadlock triggered by update/shutdown #3731

karknu · 2018-10-09T15:04:39Z

A new update will cause the block retrival worker to exit
which means that the streaming thread could end up blocking
for ever when attempting to write to the full block queue.

Update so that we limit the time that we the streaming thread will
wait on the block queue incase of an exception.

Linked issue

https://iohk.myjetbrains.com/youtrack/issue/CBR-424

Type of change

🐞 Bug fix (non-breaking change which fixes an issue)
🛠 New feature (non-breaking change which adds functionality)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🏭 Refactoring that does not change existing functionality but does improve things like code readability, structure etc
🔨 New or improved tests for existing code
⛑ git-flow chore (backport, hotfix, etc)

Developer checklist

I have read the style guide document, and my code follows the code style of this project.
If my code deals with exceptions, it follows the guidelines.
[~] I have updated any documentation accordingly, if needed. Documentation changes can be reflected in opening a PR on cardanodocs.com, amending the inline Haddock comments, any relevant README file or one of the document listed in the docs directory.
[~] CHANGELOG entry has been added and is linked to the correct PR on GitHub.

Testing checklist

I have added tests to cover my changes.
All new and existing tests passed.

QA Steps

How to reproduce the fault, and test the fix is provided in the linked issue above. The fix is only verifiable with --legacy-wallet since the new wallet doesn't handle update/shutdown.

Screenshots (if available)

coot

Other than the requested comment, LGTM

coot · 2018-10-10T09:23:16Z

lib/src/Pos/Diffusion/Full/Block.hs

@@ -335,7 +336,20 @@ streamBlocks logTrace smM logic streamWindow enqueue nodeId tipHeader checkpoint
                         processBlocks batchSize n' (block : blocks) blockChan

    writeStreamEnd :: Conc.TBQueue StreamEntry -> IO ()
-    writeStreamEnd blockChan = atomically $ Conc.writeTBQueue blockChan StreamEnd
+    writeStreamEnd blockChan = writeBlock 1024 blockChan StreamEnd


Maybe just add a comment that a reader might stop and the queue might fill so we are dealing with this case here.

I will add a comment.

avieth

Interesting bug. I wonder, why can't GHC detect the deadlock and kill with a blocked idefinitely on transaction exception? Do we still keep all of our mutable state things in a reader context? I'll bet if we didn't, GHC would detect the deadlock.

avieth · 2018-10-10T13:39:16Z

lib/src/Pos/Diffusion/Full/Block.hs

+    writeStreamEnd blockChan = writeBlock 1024 blockChan StreamEnd
+
+    writeBlock :: Int -> Conc.TBQueue StreamEntry -> StreamEntry -> IO ()
+    writeBlock delay _ _ | delay >= 4000000 = do


This is a little opaque. Why 4000000?

4000000 corresponds to a 4 seconds delay, we need a limit for when we give up, and the time we sleep doubles after every time we try so a 4 second limit seemed reasonable.

avieth · 2018-10-10T13:40:07Z

lib/src/Pos/Diffusion/Full/Block.hs

@@ -348,7 +362,7 @@ streamBlocks logTrace smM logic streamWindow enqueue nodeId tipHeader checkpoint
    requestBlocks :: Conc.TVar Bool -> Conc.TBQueue StreamEntry -> IO (Conc.TVar (OQ.PacketStatus ()))
    requestBlocks fallBack blockChan = do
        convMap <- enqueue (MsgRequestBlocks (S.singleton nodeId))
-                            (\_ _ -> (Conversation $ \it -> requestBlocksConversation blockChan it `finally` writeStreamEnd blockChan) :|
+                            (\_ _ -> (Conversation $ \it -> requestBlocksConversation blockChan it `onException` writeStreamEnd blockChan) :|


If it's essential to use onException rather than finally, then we should put a comment explaining it. Why can requestBatch still use finally?

requestBatch can get away with finally because the only thing it will write is StreamEnd for which there will always be room. I will add a comment.

A new update will cause the block retrival worker to exit which means that the streaming thread could end up blocking for ever when attempting to write to the full block queue. Update so that we limit the time that we the streaming thread will wait on the block queue incase of an exception.

karknu requested review from avieth, coot and cleverca22 October 9, 2018 15:12

coot approved these changes Oct 10, 2018

View reviewed changes

avieth reviewed Oct 10, 2018

View reviewed changes

karknu force-pushed the karknu/cbr-424_update branch from 3270849 to b319a2a Compare October 11, 2018 08:48

karknu merged commit 69cb520 into develop Oct 11, 2018

karknu deleted the karknu/cbr-424_update branch October 11, 2018 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CBR-424] Fix deadlock triggered by update/shutdown #3731

[CBR-424] Fix deadlock triggered by update/shutdown #3731

karknu commented Oct 9, 2018 •

edited

Loading

coot left a comment

coot Oct 10, 2018

karknu Oct 10, 2018

avieth left a comment

avieth Oct 10, 2018

karknu Oct 10, 2018 •

edited

Loading

avieth Oct 10, 2018

karknu Oct 10, 2018

[CBR-424] Fix deadlock triggered by update/shutdown #3731

[CBR-424] Fix deadlock triggered by update/shutdown #3731

Conversation

karknu commented Oct 9, 2018 • edited Loading

Linked issue

Type of change

Developer checklist

Testing checklist

QA Steps

Screenshots (if available)

coot left a comment

Choose a reason for hiding this comment

coot Oct 10, 2018

Choose a reason for hiding this comment

karknu Oct 10, 2018

Choose a reason for hiding this comment

avieth left a comment

Choose a reason for hiding this comment

avieth Oct 10, 2018

Choose a reason for hiding this comment

karknu Oct 10, 2018 • edited Loading

Choose a reason for hiding this comment

avieth Oct 10, 2018

Choose a reason for hiding this comment

karknu Oct 10, 2018

Choose a reason for hiding this comment

karknu commented Oct 9, 2018 •

edited

Loading

karknu Oct 10, 2018 •

edited

Loading