Security feedback: Consider employing structured error handling #621

andrey-kuprianov · 2022-12-21T16:47:58Z

Surfaced from @informalsystems security feedback for Interchain Security, at commit 0f57a24

Problem

The current error handling in the codebase represents a mixture of scattered panics (132 in the non-testing code at the time of writing) with the standard Go error handling. Having so many panics in the codebase is dangerous, and may lead to e.g. halting the chains. We recommend to refactor the code and to employ structured error handling, with a few well-justified panics at the top level.

Closing criteria

The code is refactored such that:

There are only a few well-justified panics at the top level of the codebase.
The code reachable from BeginBlockers or EndBlockers is guaranteed not to panic.

Problem details

In the present Interchain Security (ICS) codebase, there is a mixed collection of error handling patterns:

standard Golang error handling, via returning an instance of the error interface;
an abundant number of panics: 132 (in the non-testing code) at the time of writing.

To illustrate, we will take as an example the function IterateVscSendTimestamps(): the function calls two functions, types.ParseVscSendingTimestampKey and sdk.ParseTimeBytes, each returning an instance of error, and panics in both cases in case an error is indeed returned:

func (k Keeper) IterateVscSendTimestamps(
	ctx sdk.Context,
	chainID string,
	cb func(vscID uint64, ts time.Time) (stop bool),
) {
	store := ctx.KVStore(k.storeKey)
	iterator := sdk.KVStorePrefixIterator(store, types.ChainIdWithLenKey(types.VscSendTimestampBytePrefix, chainID))
	defer iterator.Close()

	for ; iterator.Valid(); iterator.Next() {
		key := iterator.Key()
		_, vscID, err := types.ParseVscSendingTimestampKey(key)
		if err != nil {
			panic(fmt.Errorf("failed to parse VscSendTimestampKey: %w", err))
		}
		ts, err := sdk.ParseTimeBytes(iterator.Value())
		if err != nil {
			panic(fmt.Errorf("failed to parse timestamp value: %w", err))
		}

		stop := cb(vscID, ts)
		if stop {
			return
		}
	}
}

Considering the call sites of the above function, we see the following two cases:

In function StopConsumerChain():
- StopConsumerChain() itself returns an error, so panicking in IterateVscSendTimestamps() on error is unnecessary, because the error could be simply propagated up the stack.
In function EndBlockCCR(), which is called in turn from EndBlock():
- Panicking in this context is dangerous, because a panic from EndBlock() would halt the chain.

We've heard some arguments that panics as they are applied in the ICS are valid, as they result from errors of very basic operations, like accessing the SDK store: if these operations fail then there is something fundamentally broken with the environment. While valid to a certain degree, we would like to propose a few counterarguments:

Even the most basic functionality may contain bugs; in particular we have witnessed a few bugs in the very foundational layers of Cosmos SDK. Panicking on errors creates a new attack vector, which could be exploited to e.g. halt the chain. Consider this scenario:
- there is a dormant bug in one of the basic Cosmos SDK functions, which can be triggered only under a specific combination of parameters (a corner case);
- an attacker sees that when a certain Cosmos SDK function is called from ICS, a panic happens in BeginBlocker or EndBlocker;
- the attacker crafts function inputs in a specific way in order to trigger the bug, and thus halts the chain (e.g. Cosmos Hub).
A internal function where the error happens (like IterateVscSendTimestamps() above) can't decide whether panicking is an appropriate course of action: it misses the context. Only the top-level functions have the complete context; all others should not panic.
Properly examining all 132 sites where panicking happens is an infeasible task, and no guarantees can be provided that some of those panics can't be triggered from a BeginBlocker or an EndBlocker, and halt the chain.

Recommendations

Introduce a hierarchy of error types, inheriting from the standard Go error interface.
Refactor all functions with panic, remove them, and return the appropriate errors instead.
Wrap errors going up the call stack, as is a standard practice since Go 1.13.
Examine the returned errors at the top level functions, and decide on the appropriate course of action, depending on the complete context.
Avoid having any panics in the low/intermediate level ICS code. The only place for panics should be at the top level. (There is an exception to this rule, like libraries for unbounded arithmetics, but to the best of our knowledge ICS doesn't contain such libraries)
Employ the Defer/Panic/Recover paradigm in BeginBlockers and EndBlockers: "wrap" the top level code in those functions using the defer statement, which, upon activation, should take the action appropriate for ICS purposes, but then recover.

We would like to explore the latter point in more details. However important Interchain Security might be, it should be always kept in mind that it is only one of the modules deployed on Cosmos Hub. Even if ICS can't function properly under certain circumstances, it should not "kill" Cosmos Hub via panicking in BeginBlockers or EndBlockers. The appropriate course of action here, instead of panicking, seems to be:

Issue a Cosmos SDK event with the error;
Log the error;
Disable, temporarily, either some part of ICS, or the whole of it.

The mechanism of disabling/re-enabling (parts of) ICS is outside of the scope of this issue, and will be discussed separately.

TODOs

Labels have been added for issue
Issue has been added to the ICS project

The text was updated successfully, but these errors were encountered:

mpoke · 2023-01-10T16:24:09Z

We went over all the panic calls in the repo. The following PRs will close this issue:

The PRs either remove panic calls when possible or document the need for such calls.

shaspitz added this to Replicated Security Dec 21, 2022

shaspitz moved this to Todo in Replicated Security Dec 21, 2022

andrey-kuprianov changed the title ~~Consider employing structured error handling~~ Security feedback: Consider employing structured error handling Dec 23, 2022

andrey-kuprianov mentioned this issue Dec 23, 2022

Security feedback: Consider implementing module isolation and modular shutdown #627

Open

5 tasks

mpoke added the source: audit To indicate an issue found during an audit. label Jan 4, 2023

mpoke self-assigned this Jan 4, 2023

mpoke moved this from Todo to Next in Replicated Security Jan 4, 2023

mpoke mentioned this issue Jan 5, 2023

Handle panic in Reward Distribution #638

Merged

2 tasks

This was referenced Jan 5, 2023

Throttle related panic cleanup #640

Merged

Key assignment related panic cleanup #642

Merged

mpoke moved this from Next to In Progress in Replicated Security Jan 6, 2023

This was referenced Jan 6, 2023

Provider panic cleanup #647

Merged

Consumer panic cleanup #655

Merged

mpoke moved this from In Progress to Waiting for review in Replicated Security Jan 10, 2023

shaspitz closed this as completed in #638 Jan 11, 2023

github-project-automation bot moved this from Waiting for review to Done in Replicated Security Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security feedback: Consider employing structured error handling #621

Security feedback: Consider employing structured error handling #621

andrey-kuprianov commented Dec 21, 2022 •

edited by mpoke

Loading

mpoke commented Jan 10, 2023

Security feedback: Consider employing structured error handling #621

Security feedback: Consider employing structured error handling #621

Comments

andrey-kuprianov commented Dec 21, 2022 • edited by mpoke Loading

Problem

Closing criteria

Problem details

Recommendations

TODOs

mpoke commented Jan 10, 2023

andrey-kuprianov commented Dec 21, 2022 •

edited by mpoke

Loading