Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

produce a consensus outcome for fatal errors #508

Closed
raulk opened this issue Apr 22, 2022 · 0 comments · Fixed by #548
Closed

produce a consensus outcome for fatal errors #508

raulk opened this issue Apr 22, 2022 · 0 comments · Fixed by #548
Assignees

Comments

@raulk
Copy link
Member

raulk commented Apr 22, 2022

Context

Fatal errors are raised when we encounter system-level unexpected conditions during message execution. They are severe and usually indicate a correctness flaw. Either something is found to be broken at runtime (e.g. state tree cannot be decoded, init actor is not found, etc.), or we've hit some kind of programming error.

Fatal errors occur in the FVM itself, outside actor code. Panics in actor code are properly handled by emitting exit code USR_ASSERTION_FAILED. See the FVM error spec.

Currently, on a fatal error, the Executor fails to apply the messge returns the error to the caller (the Filecoin client). However, there is no possible course of action the caller can take.

There are several outcomes here:

  • If the failure is caused by a local condition, the node will fork off from the network.
  • If the failure is caused by a condition reproduceable in a subset of nodes, the chain will fork.
  • If the failure is caused by a generalised error, the network could halt.

Proposal

The goal is to allow the chain to make progress in the presence of network-wide fatal errors.

  1. Convert fatal errors into receipts with a designated SYS_INTERNAL_ERROR exit code.
  2. Revert all state tree changes.
  3. Consume all gas; it doesn't matter at which point the error happened (could've been at different points in different nodes), the result gas consumption will be identical.

This results in deterministic behaviour that the network can agree on in the presence of fatal errors.

Result

  1. If the fatal error affects a single node, that node will produce the above message receipt leading to a consensus fault with the rest of the network, i.e. node strays off just like before.
  2. If the fatal error affects the entire network, the network agrees that an internal error happened during the processing of that message, and moves on without halting. (There's some chance that different nodes will observe different internal errors at different points in the execution, yet they will arrive to the same result; this is intended).

Implementation notes

This change can be entirely self contained inside the DefaultExecutor.

This resolves the "Panics during message execution" area of investigation under #428.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants