You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current error handling in the codebase represents a mixture of scattered panics (132 in the non-testing code at the time of writing) with the standard Go error handling. Having so many panics in the codebase is dangerous, and may lead to e.g. halting the chains. We recommend to refactor the code and to employ structured error handling, with a few well-justified panics at the top level.
Closing criteria
The code is refactored such that:
There are only a few well-justified panics at the top level of the codebase.
The code reachable from BeginBlockers or EndBlockers is guaranteed not to panic.
Problem details
In the present Interchain Security (ICS) codebase, there is a mixed collection of error handling patterns:
standard Golang error handling, via returning an instance of the error interface;
an abundant number of panics: 132 (in the non-testing code) at the time of writing.
To illustrate, we will take as an example the function IterateVscSendTimestamps(): the function calls two functions, types.ParseVscSendingTimestampKey and sdk.ParseTimeBytes, each returning an instance of error, and panics in both cases in case an error is indeed returned:
StopConsumerChain() itself returns an error, so panicking in IterateVscSendTimestamps() on error is unnecessary, because the error could be simply propagated up the stack.
Panicking in this context is dangerous, because a panic from EndBlock() would halt the chain.
We've heard some arguments that panics as they are applied in the ICS are valid, as they result from errors of very basic operations, like accessing the SDK store: if these operations fail then there is something fundamentally broken with the environment. While valid to a certain degree, we would like to propose a few counterarguments:
Even the most basic functionality may contain bugs; in particular we have witnessed a few bugs in the very foundational layers of Cosmos SDK. Panicking on errors creates a new attack vector, which could be exploited to e.g. halt the chain. Consider this scenario:
there is a dormant bug in one of the basic Cosmos SDK functions, which can be triggered only under a specific combination of parameters (a corner case);
an attacker sees that when a certain Cosmos SDK function is called from ICS, a panic happens in BeginBlocker or EndBlocker;
the attacker crafts function inputs in a specific way in order to trigger the bug, and thus halts the chain (e.g. Cosmos Hub).
A internal function where the error happens (like IterateVscSendTimestamps() above) can't decide whether panicking is an appropriate course of action: it misses the context. Only the top-level functions have the complete context; all others should not panic.
Properly examining all 132 sites where panicking happens is an infeasible task, and no guarantees can be provided that some of those panics can't be triggered from a BeginBlocker or an EndBlocker, and halt the chain.
Recommendations
Introduce a hierarchy of error types, inheriting from the standard Go error interface.
Refactor all functions with panic, remove them, and return the appropriate errors instead.
Examine the returned errors at the top level functions, and decide on the appropriate course of action, depending on the complete context.
Avoid having any panics in the low/intermediate level ICS code. The only place for panics should be at the top level. (There is an exception to this rule, like libraries for unbounded arithmetics, but to the best of our knowledge ICS doesn't contain such libraries)
Employ the Defer/Panic/Recover paradigm in BeginBlockers and EndBlockers: "wrap" the top level code in those functions using the defer statement, which, upon activation, should take the action appropriate for ICS purposes, but then recover.
We would like to explore the latter point in more details. However important Interchain Security might be, it should be always kept in mind that it is only one of the modules deployed on Cosmos Hub. Even if ICS can't function properly under certain circumstances, it should not "kill" Cosmos Hub via panicking in BeginBlockers or EndBlockers. The appropriate course of action here, instead of panicking, seems to be:
Issue a Cosmos SDK event with the error;
Log the error;
Disable, temporarily, either some part of ICS, or the whole of it.
The mechanism of disabling/re-enabling (parts of) ICS is outside of the scope of this issue, and will be discussed separately.
TODOs
Labels have been added for issue
Issue has been added to the ICS project
The text was updated successfully, but these errors were encountered:
Surfaced from @informalsystems security feedback for Interchain Security, at commit 0f57a24
Problem
The current error handling in the codebase represents a mixture of scattered panics (132 in the non-testing code at the time of writing) with the standard Go error handling. Having so many panics in the codebase is dangerous, and may lead to e.g. halting the chains. We recommend to refactor the code and to employ structured error handling, with a few well-justified panics at the top level.
Closing criteria
The code is refactored such that:
BeginBlockers
orEndBlockers
is guaranteed not to panic.Problem details
In the present Interchain Security (ICS) codebase, there is a mixed collection of error handling patterns:
error
interface;To illustrate, we will take as an example the function IterateVscSendTimestamps(): the function calls two functions,
types.ParseVscSendingTimestampKey
andsdk.ParseTimeBytes
, each returning an instance oferror
, and panics in both cases in case an error is indeed returned:Considering the call sites of the above function, we see the following two cases:
StopConsumerChain()
itself returns anerror
, so panicking inIterateVscSendTimestamps()
onerror
is unnecessary, because the error could be simply propagated up the stack.EndBlock()
would halt the chain.We've heard some arguments that panics as they are applied in the ICS are valid, as they result from errors of very basic operations, like accessing the SDK store: if these operations fail then there is something fundamentally broken with the environment. While valid to a certain degree, we would like to propose a few counterarguments:
BeginBlocker
orEndBlocker
;IterateVscSendTimestamps()
above) can't decide whether panicking is an appropriate course of action: it misses the context. Only the top-level functions have the complete context; all others should not panic.BeginBlocker
or anEndBlocker
, and halt the chain.Recommendations
error
interface.panic
, remove them, and return the appropriate errors instead.BeginBlockers
andEndBlockers
: "wrap" the top level code in those functions using thedefer
statement, which, upon activation, should take the action appropriate for ICS purposes, but then recover.We would like to explore the latter point in more details. However important Interchain Security might be, it should be always kept in mind that it is only one of the modules deployed on Cosmos Hub. Even if ICS can't function properly under certain circumstances, it should not "kill" Cosmos Hub via panicking in
BeginBlockers
orEndBlockers
. The appropriate course of action here, instead of panicking, seems to be:The mechanism of disabling/re-enabling (parts of) ICS is outside of the scope of this issue, and will be discussed separately.
TODOs
The text was updated successfully, but these errors were encountered: