Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Spawn task for block proof verification (#288) #308

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

aszepieniec
Copy link
Contributor

  • Wrap call to triton_vm::verify in spawn_blocking.
  • Make function async and percolate async markers.
  • Sandwich relevant call between debug messages.

All observations point to this change eliminating issue #282. However, we do not have a mechanical explanation as to what causes the apparent fix.

Verifying a block proof takes on the order of hundreds of microseconds.
It is good practice to wrap such calls inside of a `spawn_blocking`
invocation, which is what this commit does.

Besides observing good practice, it seems as though this change fixes
issue #282. The reported crash was not yet observed with this change
in place. However, we do not understand the relevant mechanics or
have an educated guess as to why this change causes the issue to go
away. For all we know, the issue might appear again even with this
change.

In order to use `spawn_blocking`, the function in question must be
made `async`. This change percolates up the call graph.

The call to `triton_vm::verify` that pinpoints issue #282, or more
precisely, the `spawn_blocking` wrapper that contains it, is sandwiched
in between log statements. So if the node hangs again, the node operator
will be able to pinpoint the cause.
@aszepieniec aszepieniec changed the base branch from master to release December 29, 2024 12:43
@aszepieniec aszepieniec mentioned this pull request Dec 29, 2024
@aszepieniec
Copy link
Contributor Author

Based on this thought from the discussion in #282 --

Here is a thought though: we cannot rule out that the peer thread/task crashed, without affecting other running threads/tasks. You expect that the main_loop() would still be running, and Ctrl-C should still respond etc. Then how would we observe that the peer thread/task crashed? One possible outcome could be never terminating connections. Maybe #307 is linked after all.

-- I tend to think that this change limits the crash to the thread/task as opposed to the whole application, but does not fix it. While I do think it is good practice to wrap long-running computations like triton_vm::verify inside of spawn_blocking calls, I think it is a good idea to delay proceeding with this PR until issue #282 is comprehensively resolved.

@aszepieniec
Copy link
Contributor Author

With #315 I think issue #282 is comprehensively resolved and so we can proceed with this PR. Any objections, @dan-da, @Sword-Smith?

@aszepieniec aszepieniec merged commit e155c6d into release Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant