Skip to content
This repository has been archived by the owner on Aug 18, 2020. It is now read-only.

[CBR-424] Fix deadlock triggered by update/shutdown #3731

Merged
merged 1 commit into from
Oct 11, 2018

Conversation

karknu
Copy link
Contributor

@karknu karknu commented Oct 9, 2018

A new update will cause the block retrival worker to exit
which means that the streaming thread could end up blocking
for ever when attempting to write to the full block queue.

Update so that we limit the time that we the streaming thread will
wait on the block queue incase of an exception.

Linked issue

https://iohk.myjetbrains.com/youtrack/issue/CBR-424

Type of change

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • 🛠 New feature (non-breaking change which adds functionality)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🏭 Refactoring that does not change existing functionality but does improve things like code readability, structure etc
  • 🔨 New or improved tests for existing code
  • ⛑ git-flow chore (backport, hotfix, etc)

Developer checklist

  • I have read the style guide document, and my code follows the code style of this project.
  • If my code deals with exceptions, it follows the guidelines.
  • [~] I have updated any documentation accordingly, if needed. Documentation changes can be reflected in opening a PR on cardanodocs.com, amending the inline Haddock comments, any relevant README file or one of the document listed in the docs directory.
  • [~] CHANGELOG entry has been added and is linked to the correct PR on GitHub.

Testing checklist

  • I have added tests to cover my changes.
  • All new and existing tests passed.

QA Steps

How to reproduce the fault, and test the fix is provided in the linked issue above. The fix is only verifiable with --legacy-wallet since the new wallet doesn't handle update/shutdown.

Screenshots (if available)

@karknu karknu requested review from avieth, coot and cleverca22 October 9, 2018 15:12
Copy link
Contributor

@coot coot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the requested comment, LGTM

@@ -335,7 +336,20 @@ streamBlocks logTrace smM logic streamWindow enqueue nodeId tipHeader checkpoint
processBlocks batchSize n' (block : blocks) blockChan

writeStreamEnd :: Conc.TBQueue StreamEntry -> IO ()
writeStreamEnd blockChan = atomically $ Conc.writeTBQueue blockChan StreamEnd
writeStreamEnd blockChan = writeBlock 1024 blockChan StreamEnd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add a comment that a reader might stop and the queue might fill so we are dealing with this case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment.

Copy link
Contributor

@avieth avieth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting bug. I wonder, why can't GHC detect the deadlock and kill with a blocked idefinitely on transaction exception? Do we still keep all of our mutable state things in a reader context? I'll bet if we didn't, GHC would detect the deadlock.

writeStreamEnd blockChan = writeBlock 1024 blockChan StreamEnd

writeBlock :: Int -> Conc.TBQueue StreamEntry -> StreamEntry -> IO ()
writeBlock delay _ _ | delay >= 4000000 = do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little opaque. Why 4000000?

Copy link
Contributor Author

@karknu karknu Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4000000 corresponds to a 4 seconds delay, we need a limit for when we give up, and the time we sleep doubles after every time we try so a 4 second limit seemed reasonable.

@@ -348,7 +362,7 @@ streamBlocks logTrace smM logic streamWindow enqueue nodeId tipHeader checkpoint
requestBlocks :: Conc.TVar Bool -> Conc.TBQueue StreamEntry -> IO (Conc.TVar (OQ.PacketStatus ()))
requestBlocks fallBack blockChan = do
convMap <- enqueue (MsgRequestBlocks (S.singleton nodeId))
(\_ _ -> (Conversation $ \it -> requestBlocksConversation blockChan it `finally` writeStreamEnd blockChan) :|
(\_ _ -> (Conversation $ \it -> requestBlocksConversation blockChan it `onException` writeStreamEnd blockChan) :|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's essential to use onException rather than finally, then we should put a comment explaining it. Why can requestBatch still use finally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requestBatch can get away with finally because the only thing it will write is StreamEnd for which there will always be room. I will add a comment.

A new update will cause the block retrival worker to exit
which means that the streaming thread could end up blocking
for ever when attempting to write to the full block queue.

Update so that we limit the time that we the streaming thread will
wait on the block queue incase of an exception.
@karknu karknu force-pushed the karknu/cbr-424_update branch from 3270849 to b319a2a Compare October 11, 2018 08:48
@karknu karknu merged commit 69cb520 into develop Oct 11, 2018
@karknu karknu deleted the karknu/cbr-424_update branch October 11, 2018 11:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants