[R4R] Implement BEP 130: Parallel Transaction Execution #12

setunapo · 2022-04-27T06:07:49Z

Description

This is the implementation of BEP 130:Parallel Transaction Execution
The motivation and architecture design could refer the BEP document.
To be more precise, this is implementation of Parallel 1.0, there will be Parallel 2.0(performance enhance), Parallel 3.0(validator mode) later.

With Parallel 1.0, we mainly did 4 jobs:
** Architecture Design
** Implement major modules and workflow
** Performance & Resource: we tried to optimize the pipeline and save CPU&Memory cost;
** Stability: we verified most of BSC's blocks with parallel mode enabled, corner cases & stable issues took us lots of efforts;

Specification

Architecture

Parallel transaction execution will only touch the execution layer, mainly state_processor.go and state_db.go, the architecture can be briefly described with 3 diagrams:

Module
Pipeline
Lifecycle of transaction

Module
SlotDB Reuse was implemented and removed in Parallel 1.0, will be explained later.

Pipeline
Take 8 concurrency as an example for a brief view.

Lifecycle of transaction
If there is no conflict the lifecycle is straightforward.

If there is conflict, Redo is added to the lifecycle.

Dispatcher is responsible for transaction prepare, dispatch and result merge; The dispatch policy could impact the execution performance, we tried to dispatch these potential conflict transactions to same slot, to reduce the conflict rate;

Execution Slot is a separate go routine which is responsible for transaction execution, conflict detect, transaction redo; The number of execution slot is equal to parallel.num, it can be configure with command option --parallel.num 4; The default parallel num depends on CPUNum, if CPUNum < 10, we set it to CPUNum - 1, otherwise, we set it to 8 by default; More concurrency could have other side effects, it is not linear beneficial, we found 8 concurrency is slightly better than 15 with CPUNum = 16;

Slot DB is new module, it is created for the parallel execute slot; All the state changes made by the parallel slot will be recorded in this slot db first and will be merge to the main StateDB when the execution result is confirmed;

Conflict Detect&Redo transaction execution has dependency in BSC or Ethereum, all the transactions share a same single world state; The transaction are sorted in order in a block. Conflict detect is used to check if the result of the parallel executed transaction is valid or not; We use DirtyRead to do conflict detect, that is if any of the state(balance, nonce, code, KV Storage...) read by the transaction was changed(dirty), we mark it as conflicted; These conflicted transactions will be scheduled a Redo with the latest world state;

Merger is responsible to merge the confirmed result into the main StateDB, it is executed within the dispatcher routine, since it needs to access and main StateDB; To keep concurrent safe, we limit the access to main StateDB on dispatcher only, the concurrent execution slot could access the main StateDB directly;

Staged Execution applyTransaction() will be split into 2 stages: Execution Stage & Finalize Stage; Execution stage is for pure EVM execution, its result could be conflicted, it can be scheduled with ReDo to get the valid result; Only when execution Stage's result is confirmed, finalized stage will be executed to finalized the execution result, move dirty objects into the pending object list;

SlotDB Reuse was added to reduce memory consumption, but it is removed now since Copy-On-Write is added, reuse adds more complexity with limited gains; But it could be re-enabled in Parallel 2.0 to address the GC problem.

Copy-On-Write is used to reduce memory consumption too, the concurrent transaction will not copy StateObject unless it tries to write it; We don't need to copy all the StateObject of main StateDB, since only a subset of the StateObjects is needed;

Compatible Test

Parallel execution has been verified on most of the BSC blocks. As shown below, the compatible test has been running for more than 1 month with 3 test bed, although there are many code changes since then, it still demonstrate the compatibility of Parallel implementation.

Performance Test

It is a pity to say the performance result is not good as expected, we did the performance compare test several times and result is summarized as:

BSC v1.1.8 is a performance release, it uses prefetch to accelerate transaction execution.
Parallel mode is worse than v1.1.8, but it is better than v1.1.7.

Another test report with metric

Why even worse

un-balanced transaction ⭐️⭐️⭐️⭐️⭐️
Parallel 1.0 will not execution next transaction until the previous transaction finalized, it is reasonable but it makes the pipeline inefficient;
Parallel 2.0 will use more aggressive policy: No Waiting Streaming Pipeline to fix it

additional costs ⭐️⭐️⭐️
There are addition cost to schedule the pipeline, the accumulated cost can not be ignored. With a rough estimate as blow:

Dispatch IPC: ~20us(3 IPC no conflict, 5 IPC on conflict)
New SlotDB: ~50us
Conflict Detect: ~20us
Merge: ~100us
StateObject Copy: ~10us for each StateObject (~3 DeepCopy on Average)
Total: ~250us

Transaction Redo⭐️⭐️
The conflict rate has been reduced from ~40% to ~10% right now, its impact has been mitigated, so mark its impact with 2 stars.

GC⭐️⭐️
GC will be more frequent in parallel mode, we could allocate additional ~50KB memory for each transaction, mainly to keep its state changes; That is almost 10MB for a block(~200 transactions a block). GC will have big impact on execution.

CPU Cost

Just for reference, CPU occupation is a bit higher in parallel mode.

WAGMI: We are going to make it

Although Parallel 1.0's performance result is a little disappointing, but we still have confidence that the Parallel execution is quite valuable.
We have Parallel 2.0 under schedule, it will try to solve these known performance bottlenecks and make performance much better.

Add a new interface StateProcessor.ProcessParallel(...), it is a copy of Process(...) right now. This patch is a placeholder, we will implement BEP-130 based on it.

** modules of init, slot executer and dispatcher BEP 130 parallel transaction execution will maintain a tx execution routine pool, a configured number of slot(routine) to execution transactions. Init is executed once on startup and will create the routine pool. Slot executer is the place to execute transactions. The dispacther is the module that will dispatch transaction to the right slot. ** workflow: Stage Apply, Conflict Detector, Slot, Gas... > two stages of applyTransaction For sequential execution, applyTransaction will do transaction execution and result finalization. > Conflict detector We will check the parallel execution result for each transaction. If there is a confliction, the result can not be committed, redo will be scheduled to update its StateDB and re-run For parallel execution, the execution result may not be reliable(conflict), use try-rerun policy, the transaction could be executed more than once to get the correct result. Once the result is confirm, we will finalize it to StateDB. Balance, KV, Account Create&Suicide... will be checked And conflict window is important for conflict check. > Slot StateDB Each slot will have a StateDB to execute transaction in slot. The world state changes are stored in this StateDB and merged to the main StateDB when transaction result is confirmed. SlotState.slotdbChan is the current execute TX's slotDB. And only dirty state object are allowed to merge back, otherwise, there is a race condition of merge outdated stateobject back. ** others gas pool, transaction gas, gas fee reward to system address evm instance, receipt CumulativeGasUsed & Log Index, contract creation, slot state, parallel routine safety: 1.only dispatcher can access main stateDB 2.slotDB will be created and merged to stateDB in dispatch goroutine. ** workflow 2: CopyForSlot, redesign dispatch, slot StateDB reuse & several bugfix > simplifiy statedb copy with CopyForSlot only copy dirtied state objects delete prefetcher ** redesign dispatch, slot StateDB reuse... > dispatch enhance remove atomic idle, curExec... replace by pendingExec for slot. >slot StateDB reuse It will try to reuse the latest merged slotDB in the same slot. If reuse failed(conflict), it will try to update to the latest world state and redo. The reuse SlotDB will the same BaseTxIndex, since its world state was sync when it was created based on that txIndex Conflict check can skip current slot now. it is more aggressive to reuse SlotDB for idle dispatch not only pending Txs but also the idle dispatched Txs try to reuse SlotDB now. ** others state change no needs to store value add "--parallel" startup options Parallel is not enabled by default. To enable it, just add a simple flag to geth: --parallel To config parallel execute parameter: --parallel.num 20 --parallel.queuesize 30 "--parallel.num" is the number of parallel slot to execute Tx, by default it is CPUNum-1 "--parallel.queuesize" is the maxpending queue size for each slot, by default it is 10 For example: ./build/bin/geth --parallel ./build/bin/geth --parallel --parallel.num 10 ./build/bin/geth --parallel --parallel.num 20 --parallel.queuesize 30 ** several BugFix 1.system address balance conflict We take system address as a special address, since each transaction will pay gas fee to it. Parallel execution reset its balance in slotDB, if a transaction try to access its balance, it will receive 0. If the contract needs the real system address balance, we will schedule a redo with real system address balance One transaction that accessed system address: https://bscscan.com/tx/0xcd69755be1d2f55af259441ff5ee2f312830b8539899e82488a21e85bc121a2a 2.fork caused by address state changed and read in same block 3.test case error 4.statedb.Copy should initialize parallel elements 5.do merge for snapshot

** move .Process() close to .ProcessParallel() ** InitParallelOnce & preExec & postExec for code maintenance ** MergedTxInfo -> SlotChangeList & debug conflict ratio ** use ParallelState to keep all parallel statedb states. ** enable queue to same slot ** discard state change of reverted transaction And debug log refine ** add ut for statedb

…ch for parallel this patch has 3 changes: 1.change default queuesize to 20, since 10 could be not enough and will cause more conflicts 2.enable slot DB trie prefetch, use the prefetch of main state DB. 3.disable transaction cache prefetch when parallel is enabled since in parallel mode CPU resource could be limitted, and paralle has its own piped transaction execution 4.change dispatch policy ** queue based on from address ** queue based on to address, try next slot if current is full Since from address is used to make dispatch policy, the pending transactions in a slot could have several different To address, so we will compare the To address of every pending transactions.

** use sync map for the stateObjects in parallel ** others fix a SlotDB reuse bug & enable it delete unnecessary parallel initialize for none slot DB.

…t, prefetch, fork This is a complicated patch, to do some fixup ** fix MergeSlotDB Since copy-on-write is used, transaction will do StateObject deepCopy before it writes the state; All the dirty state changed will be recorded in this copied one first, the ownership will be transfered to main StateDB on merge. It has a potential race condition that the simple ownership transfer may discard other state changes by other concurrent transactions. When copy-on-write is used, we should do StateObject merge. ** fix Suicide Suicide has an address state read operation. And it also needs do copy-on-write, to avoid damage main StateDB's state object. ** fix conflict detect If state read is not zero, should do conflict detect with addr state change first. Do conflict detect even with current slot, if we use copy-on-write and slotDB reuse, same slot could has race conditon of conflict. ** disable prefetch on slotDB trie prefetch should be started on main DB on Merge ** Add/Sub zero balance, Set State These are void operation, optimized to reduce conflict rate. Simple test show, conflict rate dropped from ~25% -> 12% **fix a fork on block 15,338,563 It a nonce conflict caused by opcode: opCreate & opCreate2 Generally, the nonce is advanced by 1 for the transaction sender; But opCreate & opCreate2 will try to create a new contract, the caller will advance its nonce too. It makes the nonce conflict detect more complicated: as nonce is a fundamental part of an account, as long as it has been changed, we mark the address as StateChanged, any concurrent access to it will be considered as conflicted.

** optimize conflict for AddBalance(0) Add balance with 0 did nothing, but it will do an empty() check, and add a touch event. Add on transaction finalize, the touch event will check if the StateObject is empty, do empty delete if it is. This patch is to take the empty check as a state check, if the addr state has not been changed(create, suicide, empty delete), then empty check is reliable. ** optimize conflict for system address ** some code improvement & lint fixup & refactor for params ** remove reuse SlotDB Reuse SlotDB was added to reduce copy of StateObject, in order to mitigate the Go GC problem. And COW(Copy-On-Write) is used to address the GC problem too. With COW enabled, reuse can be removed as it has limitted benefits now and add more complexity. ** fix trie prefetch on dispatcher Trie prefetch will be scheduled on object finalize. With parallel, we should schedule trie prefetch on dispatcher, since the TriePrefetcher is not safe for concurrent access and it is created & stopped on dispatcher routine. But object.finalize on slot cleared its dirtyStorage, which broken the later trie prefetch on dispatcher when do MergeSlotDB.

No fundamental change, some improvements, include: ** Add a new type ParallelStateProcessor; ** move Parallel Config to BlockChain ** more precious ParallelNum set ** Add EnableParallelProcessor() ** remove panic() ** remove useless: redo flag, ** change waitChan from `chan int` to `chan struct {}` and communicate by close() ** dispatch policy: queue `from` ahead of `to` ** pre-allocate allLogs ** disable parallel processor is snapshot is not enabled ** others: rename...

tomatoishealthy · 2023-08-29T08:24:35Z

Parallel 1.0 will not execution next transaction until the previous transaction finalized, it is reasonable but it makes the pipeline inefficient

Maybe the STM model of Aptos could give some inspiration

setunapo and others added 8 commits February 14, 2022 15:43

Parallel: Kick off for BEP-130: Parallel Transaction Execution.

8e5d355

Add a new interface StateProcessor.ProcessParallel(...), it is a copy of Process(...) right now. This patch is a placeholder, we will implement BEP-130 based on it.

Parallel: implement COW(Copy-On-Write)

3d052a0

** use sync map for the stateObjects in parallel ** others fix a SlotDB reuse bug & enable it delete unnecessary parallel initialize for none slot DB.

setunapo requested review from lunarblock, realuncle, richardrich975 and dean65 April 27, 2022 09:03

realuncle approved these changes May 26, 2022

View reviewed changes

setunapo changed the title ~~Parallel dev~~ [R4R] Implement BEP 130: Parallel Transaction Execution May 26, 2022

dean65 approved these changes May 26, 2022

View reviewed changes

yutianwu approved these changes May 26, 2022

View reviewed changes

lunarblock approved these changes May 26, 2022

View reviewed changes

setunapo mentioned this pull request May 26, 2022

[WIP] the Implementaion of Parallel EVM 2.0 setunapo/bsc#22

Open

realuncle merged commit fbd87ec into node-real:parallel May 26, 2022

setunapo mentioned this pull request May 26, 2022

[WIP] the Implementation of Parallel EVM 2.0 #30

Closed

setunapo mentioned this pull request Oct 1, 2022

feats: the Implementation of Parallel EVM 2.0(v1.1.14 rebased) #37

Merged

This was referenced Oct 12, 2022

[WIP] the Implementation of Parallel EVM 2.0(v1.1.16 rebased) bnb-chain/bsc#1130

Closed

feats: the Implementation of Parallel EVM 2.0(v1.1.16 rebased) #39

Merged

setunapo deleted the parallel_dev branch February 5, 2023 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R4R] Implement BEP 130: Parallel Transaction Execution #12

[R4R] Implement BEP 130: Parallel Transaction Execution #12

setunapo commented Apr 27, 2022

tomatoishealthy commented Aug 29, 2023 •

edited

Loading

[R4R] Implement BEP 130: Parallel Transaction Execution #12

[R4R] Implement BEP 130: Parallel Transaction Execution #12

Conversation

setunapo commented Apr 27, 2022

Description

Specification

Architecture

Compatible Test

Performance Test

Why even worse

CPU Cost

WAGMI: We are going to make it

tomatoishealthy commented Aug 29, 2023 • edited Loading

tomatoishealthy commented Aug 29, 2023 •

edited

Loading