Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KIP-227: Candidate and Validator Evaluation #27

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
273 changes: 273 additions & 0 deletions KIPs/kip-202.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
---
kip: <to be assigned>
title: Candidate and Validator Evaluation
description: A framework for quantitatively assessing the performance and stability of candidates and validators
author: <>
discussions-to: <URL>
status: Draft
type: Core
created: 2024-11-XX

---


## Simple Summary

This proposal presents VRank, a framework for quantitatively assessing the performance and stability of candidates and validators in the Kaia Chain network.

## Abstract

As Kaia Chain transitions to a permissionless network structure, it is critical to hold validators more accountable. They play an important role in ensuring that the network runs smoothly, securely, and without problems. This KIP introduces VRank, a Validator Reputation Evaluation Framework that quantitatively assesses the performance and stability of both candidates and validators. VRank aims to ensure that nodes involved in the consensus mechanism are trustworthy and capable of meeting the security requirements of the permissionless Kaia Chain network.

## Introduction

It is planned for the Kaia Chain network to change from a Permissioned network to a Permissionless network. In a permissionless network, anyone can become a validator without being approved first. This makes the network more decentralized, safe, and open to everyone. This change fits with Kaia Chain's goal of making the blockchain ecosystem more open and strong. Please refer to [KGP-4: Permissionless Kaia Chain](https://govforum.kaia.io/t/kgp-4-permissionless-klaytn/135) for more thorough details for switching to a permissionless network.

## Motivation

In decentralized networks employing blockchain technology, the reliability and performance of validator nodes are crucial for maintaining network stability, security, and efficiency. Validator nodes propose and validate new blocks, ensuring ledger integrity and establishing trust among participants through consistent operation.

However, not all validator nodes operate at optimal efficiency. Certain individuals may experience frequent failures or delays, while others may exhibit malicious behavior, either deliberately or due to external pressures. These problems may lead to delays in the network, a chance of forks, and higher susceptibility to attacks, such as double-spending and censorship.

The current systems for evaluating validator performance may ineffectively penalize persistent underperformance or malicious behavior, and they fail to consistently encourage optimal performance incentives. A thorough evaluation system is essential to precisely evaluate the reliability of validator nodes, discourage inadequate performance, and improve the overall integrity of the network.

This proposal introduces an innovative evaluation framework for candidates and validators, highlighting measurable performance metrics. Our objective is to create a more resilient and fair framework for assessing node performance by defining metrics such as the Proposal Failure Score (PFS) and Message Transmission Failure Scores (Total and Consecutive). This framework aims to identify underperforming or malicious nodes, which helps preserve high standards among validators.

The framework promotes consistent uptime and reliability. Validators have an incentive to maintain the stability and responsiveness of their nodes, thereby maintaining the performance of the network.

## Specification

### Parameters

| Constant | Value/Definition |
| :---- | :---- |
| `FORK_BLOCK` | TBD |
| `CANDIDATE_READY_TIMEOUT` | 200 milliseconds (0.2 seconds) |
| `BLOCK_TIME` | 1 second per block |
| `EPOCH_LENGTH` | 86,400 blocks (approximately 1 day, assuming 1-second block time) |
| `MAX_BYZANTINE_NODES` (`F`) | Calculated as `F = (n - 1) // 3`, where `n` is the number of validators |
| `DOWNTIME_THRESHOLD` | 0.5% per day (equivalent to fewer than 432 blocks missed per day) |
| `PFS_FAILURE_THRESHOLD` | No more than one block proposal failure per day |
| `CONSECUTIVE_FAILURE_LENGTH_10_CF` | 10 consecutive failures define a 10-CF |
| `CONSECUTIVE_FAILURE_LENGTH_15_CF` | 15 consecutive failures define a 15-CF |

### Definition of a Stable Node

VRank defines a node qualified to perform the role of a validator as follows.

**Downtime**: A stable node must have downtime due to network or node issues below the `DOWNTIME_THRESHOLD` (equivalent to fewer than 432 blocks missed).

**Block Proposal Participation**: A stable node must consistently participate in block proposals with no more than `PFS_FAILURE_THRESHOLD` (one block proposal failure per day) for any reason.

### Node Models Definition

VRank categorizes nodes into four models to evaluate their performance and stability:

| Node | Performance | Impact |
| :---- | :---- | :---- |
| **Uptime \> 99.5%, No network issues** | **Excellent** | **Contribute the network stability** |
| **Uptime about 99.5% temporally unstable** | **Good** | **May delay block time** |
| **Uptime \< 99.5%** | **Not good** | **May Fail to propose a block** |
| **Halts continuously regardless uptime** | **Bad** | **May affect consensus if consist of nodes experiencing this** |
| **Uptime \> 99.5%, try to destabilize the network** | **N/A** | **Threat network integrity** |

1. **Node A: Stable Node**

**Characteristics**: Capable of performing validation duties with optimal performance and stability.

**Impact on Network**: Contributes positively to network stability and performance.



2. **Node B: Temporarily Unstable Node**

**Characteristics**: Experiences brief, frequent network disruptions that last a few seconds.

**Impact on Network**: May delay block creation if selected as a proposer but does not cause a round change.



3. **Node C: Intermittently Stopping Node**

**Characteristics**: Experiences longer network disruptions (tens of seconds).

**Impact on Network**: May fail to propose a block when selected as a proposer, resulting in round changes and significant delays.



4. **Node M: Malicious Node**

**Characteristics**: Intentionally attempts to destabilize the network through malicious actions.

**Impact on Network**: Threatens network security and integrity. VRank aims to mitigate the influence of such nodes.

### Block Header Changes

Starting from `FORK_BLOCK`, the block proposer must include a new field `vrank` in the block header.

```py
class Header:
parentHash: hash
# ... existing fields ...
extra: bytes
governance: bytes
vote: bytes
baseFee: int
randomReveal: bytes
mixHash: bytes
vrank: bytes # New field
```

**The `vrank` field comprises two subfields: `pfReport` and `mfReport`**
**`pfReport`**: A list of block proposers that caused a round change in the previous block, recorded in the order of the rounds.

**`mfReport`**: A mapping that records the candidates who submitted the `CandidateReady` message in the previous block along with their signatures, formatted as `[candidate ID, signature]`.

### Changes to Block Validation Process

Once `FORK_BLOCK` is reached, validators must validate the newly added `vrank` field in the block header. The values of the subfields (`pfReport` and `mfReport`) are used to evaluate node performance using the components of the VRank framework.

## VRank Score Components

The VRank framework evaluates node performance using three independent metrics. These metrics apply separately to validators and candidates, allowing for a more focused assessment of each role's responsibilities.

### 1\. Proposal Failure Score (PFS)

**Definition**: The Proposal Failure Score (PFS) measures the number of times a validator fails to propose a block successfully.

**Measurement Method**: If a validator fails to propose a block, resulting in a round change, the proposal failure count increases by one .

**Consensus Method**: The proposer of block `N+1` records the proposal failure information of block `N` in the form of a list of `(round number, proposer)` within the block header. Validators compare their own records of proposal failures in block `N` with the records in the header of block `N+1` to reach consensus.

### 2\. Message Transmission Failure Score (MFS)

The Message Transmission Failure Score is divided into two components:

a. **Total Message Transmission Failure Score (TMFS)**

b. **Consecutive Message Transmission Failure Score (CMFS)**

#### 2.1 Total Message Transmission Failure Score (TMFS)

**Definition**: Measures the number of times a candidate fails to transmit the expected `CandidateReady` message during a block proposal cycle, after removing the highest `F` of the failure counts to address measurement distortions.

**Measurement Method**: During the evaluation period, if the next proposer `N+1` receives a block proposal from proposer `N` and does not receive the `CandidateReady` message within the specified timeout (`CANDIDATE_READY_TIMEOUT`), the total failure count for `C` increases by 1\.



##### Example: Mitigating Measurement Distortion in TMFS Calculation

Consider a network with 10 validators and 5 candidates. The following table shows the number of message transmission failures reported by each proposer (P1 to P10) for each candidate (C1 to C5).

| Candidate \\ Proposer | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | Total Failures | Filtered Failures |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| **C1** | 14 | 12 | 15 | 34 | 12 | 32 | 20 | 8640 | 8637 | 8634 | 26050 | **139** |
| **C2** | 48 | 10 | 59 | 33 | 49 | 49 | 41 | 8640 | 8637 | 8634 | 26200 | **289** |
| **C3** | 48 | 22 | 40 | 41 | 44 | 27 | 61 | 8640 | 8637 | 8634 | 26194 | **283** |
| **C4** | 50 | 29 | 45 | 30 | 23 | 2 | 42 | 56 | 56 | 64 | 397 | **221** |
| **C5** | 71 | 34 | 62 | 5 | 11 | 20 | 18 | 30 | 19 | 13 | 283 | **116** |

**Explanation**
The high failure counts reported by P8, P9, and P10 for candidates C1 to C3 indicate potential measurement distortion by malicious validators.
To mitigate this, we exclude the highest `F` (in this case, `F=3`) failure counts for each candidate.
The **Filtered Failures** column shows the total failures after excluding the highest 3 counts.

#### 2.2 Consecutive Message Transmission Failure Score (CMFS)

**Definition**: CMFS measures extended sequences of consecutive transmission failures. This metric helps identify nodes with chronic instability.

**Measurement Method**:
If 10 consecutive proposers report that candidate `C` fails to send the `CandidateReady` message, it is recorded as 1 instance of 10-CFs, and the CMFS increases by 1 after 15 such instances.
If 15 consecutive proposers report that candidate `C` fails to send the `CandidateReady` message, it is recorded as 1 instance of 15-CFs, and the CMFS increases by 2 after 10 such instances.

##### CMFS Calculation Example

Consider the following example where candidates C1 and C2 have message transmission failures over a series of blocks:

| Block Number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 3-CFs | 5-CFs |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| **C1** | x | x | x | | | x | x | x | | | x | 3 | 0 |
| **C2** | | | | x | x | x | x | x | x | x | x | 1 | 1 |

**Explanation**:
**C1** has three instances of 3 consecutive failures but no instance of 5 consecutive failures.
**C2** has one instance of both 3-CFs and 5-CFs.

##### Pseudocode for CMFS Calculation
```python
def update_cmfs(candidate, success, block_number):
state = cmfs_state[candidate]

if not success:
if state['cf_start'] == 0:
# Start of a new consecutive failure sequence
state['cf_start'] = block_number
else:
if state['cf_start'] != 0:
# End of a consecutive failure sequence
cf_length = block_number - state['cf_start']

if cf_length >= 15:
state['count_15_cf'] += 1
if state['count_15_cf'] == 10:
state['cmfs_score'] += 2
elif cf_length >= 10:
state['count_10_cf'] += 1
if state['count_10_cf'] == 15:
state['cmfs_score'] += 1
# Reset the start point
state['cf_start'] = 0
```

## Rationale

### Importance of Mitigating Malicious Behavior

**Byzantine Nodes**
In a permissionless environment, some validators may act maliciously, attempting to disrupt the network or unfairly penalize honest nodes. it is assumed that up to one-third of the validators may behave maliciously.

**Filtering Mechanisms**
To mitigate the impact of malicious validators, the highest `F` failure reports are excluded in TMFS calculations. This ensures that the actions of a few Byzantine nodes do not distort the evaluation of honest candidates.

**Robust Scoring Algorithms**
VRank's design ensures that honest nodes are not unfairly penalized due to the actions of Byzantine nodes.

### Importance of the 200ms Deadline

**Ensuring Consensus Responsiveness**
The **200ms** deadline for `CANDIDATE_READY_TIMEOUT` ensures that candidates respond promptly, supporting the network's goal of generating blocks every second.

**Balancing Network Latency**
The deadline accounts for global network conditions, allowing for network latency variations without unfairly penalizing candidates.

### Importance of Chronic Failures (CMFS)

Nodes that repeatedly fail to transmit messages consecutively pose a significant risk to network stability. By implementing a robust policy that identifies and addresses these chronic failures through CMFS, the network ensures that only reliable and stable nodes continue participating in consensus.

## Backward Compatibility

The introduction of VRank does not affect existing nodes before `FORK_BLOCK`. Nodes operating prior to `FORK_BLOCK` will continue to function as before. After `FORK_BLOCK`, the new `vrank` field and associated validation processes come into effect.

## Security Considerations

### Handling Byzantine Nodes

**Assumption of One-Third Malicious Validators**
We accept the standard Byzantine fault tolerance assumption that up to one-third of validators may behave maliciously. Kaia Chain relies on the assumption that less than one-third of participants are malicious to ensure safety and liveness. VRank's scoring mechanism is designed with this threshold in mind, allowing the network t o function correctly even in the presence of some malicious actors.

**Limitations and Contingencies**

If the number of malicious validators exceeds one-third, the network's ability to reach consensus and maintain integrity may be compromised.

**Justification for the Assumption**
While it's challenging to prevent all malicious activity, assuming that up to one-third of validators could be compromised provides a practical balance between security and network performance.

## Implementation

TBD

## Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).

Add Comment
Loading