Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security feedback: Consider employing structured error handling #621

Closed
2 of 4 tasks
andrey-kuprianov opened this issue Dec 21, 2022 · 1 comment · Fixed by #638
Closed
2 of 4 tasks

Security feedback: Consider employing structured error handling #621

andrey-kuprianov opened this issue Dec 21, 2022 · 1 comment · Fixed by #638
Assignees
Labels
source: audit To indicate an issue found during an audit.

Comments

@andrey-kuprianov
Copy link

andrey-kuprianov commented Dec 21, 2022

Surfaced from @informalsystems security feedback for Interchain Security, at commit 0f57a24

Problem

The current error handling in the codebase represents a mixture of scattered panics (132 in the non-testing code at the time of writing) with the standard Go error handling. Having so many panics in the codebase is dangerous, and may lead to e.g. halting the chains. We recommend to refactor the code and to employ structured error handling, with a few well-justified panics at the top level.

Closing criteria

The code is refactored such that:

  • There are only a few well-justified panics at the top level of the codebase.
  • The code reachable from BeginBlockers or EndBlockers is guaranteed not to panic.

Problem details

In the present Interchain Security (ICS) codebase, there is a mixed collection of error handling patterns:

  • standard Golang error handling, via returning an instance of the error interface;
  • an abundant number of panics: 132 (in the non-testing code) at the time of writing.

To illustrate, we will take as an example the function IterateVscSendTimestamps(): the function calls two functions, types.ParseVscSendingTimestampKey and sdk.ParseTimeBytes, each returning an instance of error, and panics in both cases in case an error is indeed returned:

func (k Keeper) IterateVscSendTimestamps(
	ctx sdk.Context,
	chainID string,
	cb func(vscID uint64, ts time.Time) (stop bool),
) {
	store := ctx.KVStore(k.storeKey)
	iterator := sdk.KVStorePrefixIterator(store, types.ChainIdWithLenKey(types.VscSendTimestampBytePrefix, chainID))
	defer iterator.Close()

	for ; iterator.Valid(); iterator.Next() {
		key := iterator.Key()
		_, vscID, err := types.ParseVscSendingTimestampKey(key)
		if err != nil {
			panic(fmt.Errorf("failed to parse VscSendTimestampKey: %w", err))
		}
		ts, err := sdk.ParseTimeBytes(iterator.Value())
		if err != nil {
			panic(fmt.Errorf("failed to parse timestamp value: %w", err))
		}

		stop := cb(vscID, ts)
		if stop {
			return
		}
	}
}

Considering the call sites of the above function, we see the following two cases:

  1. In function StopConsumerChain():
    • StopConsumerChain() itself returns an error, so panicking in IterateVscSendTimestamps() on error is unnecessary, because the error could be simply propagated up the stack.
  2. In function EndBlockCCR(), which is called in turn from EndBlock():
    • Panicking in this context is dangerous, because a panic from EndBlock() would halt the chain.

We've heard some arguments that panics as they are applied in the ICS are valid, as they result from errors of very basic operations, like accessing the SDK store: if these operations fail then there is something fundamentally broken with the environment. While valid to a certain degree, we would like to propose a few counterarguments:

  • Even the most basic functionality may contain bugs; in particular we have witnessed a few bugs in the very foundational layers of Cosmos SDK. Panicking on errors creates a new attack vector, which could be exploited to e.g. halt the chain. Consider this scenario:
    • there is a dormant bug in one of the basic Cosmos SDK functions, which can be triggered only under a specific combination of parameters (a corner case);
    • an attacker sees that when a certain Cosmos SDK function is called from ICS, a panic happens in BeginBlocker or EndBlocker;
    • the attacker crafts function inputs in a specific way in order to trigger the bug, and thus halts the chain (e.g. Cosmos Hub).
  • A internal function where the error happens (like IterateVscSendTimestamps() above) can't decide whether panicking is an appropriate course of action: it misses the context. Only the top-level functions have the complete context; all others should not panic.
  • Properly examining all 132 sites where panicking happens is an infeasible task, and no guarantees can be provided that some of those panics can't be triggered from a BeginBlocker or an EndBlocker, and halt the chain.

Recommendations

  • Introduce a hierarchy of error types, inheriting from the standard Go error interface.
  • Refactor all functions with panic, remove them, and return the appropriate errors instead.
  • Wrap errors going up the call stack, as is a standard practice since Go 1.13.
  • Examine the returned errors at the top level functions, and decide on the appropriate course of action, depending on the complete context.
  • Avoid having any panics in the low/intermediate level ICS code. The only place for panics should be at the top level. (There is an exception to this rule, like libraries for unbounded arithmetics, but to the best of our knowledge ICS doesn't contain such libraries)
  • Employ the Defer/Panic/Recover paradigm in BeginBlockers and EndBlockers: "wrap" the top level code in those functions using the defer statement, which, upon activation, should take the action appropriate for ICS purposes, but then recover.

We would like to explore the latter point in more details. However important Interchain Security might be, it should be always kept in mind that it is only one of the modules deployed on Cosmos Hub. Even if ICS can't function properly under certain circumstances, it should not "kill" Cosmos Hub via panicking in BeginBlockers or EndBlockers. The appropriate course of action here, instead of panicking, seems to be:

  • Issue a Cosmos SDK event with the error;
  • Log the error;
  • Disable, temporarily, either some part of ICS, or the whole of it.

The mechanism of disabling/re-enabling (parts of) ICS is outside of the scope of this issue, and will be discussed separately.

TODOs

  • Labels have been added for issue
  • Issue has been added to the ICS project
@shaspitz shaspitz moved this to Todo in Replicated Security Dec 21, 2022
@andrey-kuprianov andrey-kuprianov changed the title Consider employing structured error handling Security feedback: Consider employing structured error handling Dec 23, 2022
@mpoke mpoke added the source: audit To indicate an issue found during an audit. label Jan 4, 2023
@mpoke mpoke self-assigned this Jan 4, 2023
@mpoke mpoke moved this from Todo to Next in Replicated Security Jan 4, 2023
@mpoke mpoke moved this from Next to In Progress in Replicated Security Jan 6, 2023
This was referenced Jan 6, 2023
@mpoke mpoke moved this from In Progress to Waiting for review in Replicated Security Jan 10, 2023
@mpoke
Copy link
Contributor

mpoke commented Jan 10, 2023

We went over all the panic calls in the repo. The following PRs will close this issue:

The PRs either remove panic calls when possible or document the need for such calls.

@github-project-automation github-project-automation bot moved this from Waiting for review to Done in Replicated Security Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: audit To indicate an issue found during an audit.
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants