Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Demand Scalability] Permissionless demand load testing & validation #742

Open
21 tasks
Olshansk opened this issue Aug 16, 2024 · 8 comments
Open
21 tasks
Assignees
Labels
infra Infra or tooling related improvements, additions or fixes scalability tooling Tooling - CLI, scripts, helpers, off-chain, etc...

Comments

@Olshansk
Copy link
Member

Objective

Ensure the network can manage permissionless gateways, applications, services and other types of demand.

Origin Document

Goals

  • Building & documenting a set of tools/processes to determine what MainNet governance parameters need to be
  • Identify any issues in enabling permissionless demand
  • Load (not just stress) test the network to its maximum today
  • https://dev.poktroll.com/operate/testing/load_testing

Deliverables

  • A load test that involves:
    • Scale for 1 to 10,000 on-chain services
    • Scale from 1 to 10,000 on-chain applications/gateways (doesn't matter if self-signing or not)
    • 5 Suppliers per service (don't need independent data nodes)
    • Number of requests per second needs to be "high enough" to create on-chain claims & proofs; can be played around with
  • Data to evaluate (scalability)
    • Growth of on-chain data state / bloat (e.g. # of MBs)
    • OS Metrics (CPU, Memory, Disk) of all actors w/ special focus on:
      • Validators
      • Full Nodes
  • Data to track (visibility)
    • Use pocketdex (indexer) to track # of on-chain claims, proofs and related events
      • Easily query counts of historical on-chain events & types
    • Use grafana dashboards to track relay difficulty
  • Blog post that will be published to the community showing testing process & performance
  • [Optional Bonus] Evaluate the impact of (if at all applicable) of session rollovers;
    • Question: are there errors during these times?

Non-goals / Non-deliverables

  • Involving community members in the load test
  • Fixing the relevant issues identified

General deliverables

  • Comments: Add/update TODOs and comments alongside the source code so it is easier to follow.
  • Testing: Add new tests (unit and/or E2E) to the test suite.
  • Makefile: Add new targets to the Makefile to make the new functionality easier to use.
  • Documentation: Update architectural or development READMEs; use mermaid diagrams where appropriate.

Creator: @Olshansk
Co-Owners: @okdas

@Olshansk Olshansk added this to the Shannon Beta TestNet Launch milestone Aug 16, 2024
@Olshansk Olshansk added this to Shannon Aug 16, 2024
@Olshansk Olshansk added infra Infra or tooling related improvements, additions or fixes tooling Tooling - CLI, scripts, helpers, off-chain, etc... labels Aug 16, 2024
@Olshansk Olshansk moved this to 🔖 Ready in Shannon Aug 16, 2024
@Olshansk
Copy link
Member Author

Update from @okdas

Hey, I wanted to share a quick update on the permissionless demand load testing effort.
- I decided to do all testing on our testnet. We've got gateways and supplier infrastructure there deployed and currently handles just hundreds of requests. 
- I don't have any interesting visuals yet. There are some findings:
    - Validator does consume a lot of resources, but it can be a result of a large number of RPC requests to the validator endpoint.
        - I'm going to change that endpoint to the full node so we validator will only validate.
        - Also there might be some room for improvement on how gateway/relayminer queries the data. Will check.
    - Gateways crash often. Might be a resource constraint, but as we are going to have a different gateway (path) - I'll throw more resources into them instead of performing deep troubleshooting/investigation.
    - Some of the blocks were pretty large for the amount of traffic (2.5 MiB). Will investigate and post findings tomorrow. (recent block - example https://shannon.testnet.pokt.network/poktroll/block/10297)
- Currently in the process of deploying an indexer so we can also get more insight.
- I had issues with creating a lot of services from one address. Same `account sequence mismatch, expected *, got *: incorrect account sequence` issue.
    - For some reason our CLI ignores `--sequence=` argument.
    - Comsmos 0.51 will have unordered transactions rendering this a non-issue in the future.
    - A workaround currently is creating many addresses, funding them with multi-send, and adding services from many accounts at the same time.

@okdas
Copy link
Member

okdas commented Sep 9, 2024

Performed more testing last week and ended up breaking the infrastructure around the validator's RPC.

To mitigate, I deployed and staked two more validators.
Will rerun the largest test yet with relayminers pointed to the different node directly (without load-balancer and ingress-nginx).

@okdas okdas moved this from 🔖 Ready to 🏗 In progress in Shannon Sep 9, 2024
@okdas
Copy link
Member

okdas commented Sep 30, 2024

Last time we synched on this, we've made a decision to:

  • Focus on performing a load-test locally on our machines before doing larger tests on TestNet.
  • Bring PATH to LocalNet.

As any somewhat large load tests currently breaks the network (#841) I'll be focusing on secondary goals - observability (lots of changes in #832) and deploying PATH on TestNet.

@okdas
Copy link
Member

okdas commented Oct 29, 2024

Have been running into this issue during load testing lately, will see if this is a low-hanging fruit.


{"level":"info","session_end_height":30,"claim_window_open_height":32,"message":"waiting & blocking until the earliest claim commit height offset seed block height"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","message":"observed earliest claim commit height offset seed block height"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","earliest_claim_commit_height":32,"message":"waiting & blocking until the earliest claim commit height for this supplier"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","earliest_claim_commit_height":32,"message":"observed earliest claim commit height"}
{"level":"info","app_addr":"pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4","service_id":"anvil","session_id":"cb5157c91af08f0d126765b9279f2b0891ef5a56e64d50f396b2273a9464240b","supplier_operator_addr":"pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj","message":"created a new claim"}
{"level":"error","error":"with hash 0d82ff8b8e65935dae1ed423c9f4e8aa29b2036df3de78d0aea43d07b1e8a1f2: failed to execute message; message index: 0: rpc error: code = FailedPrecondition desc = current block height (37) is greater than session claim window close height (36): claim attempted outside of the session's claim window: tx timed out","message":"failed to create claims"}
{"level":"error","error":"with hash 0d82ff8b8e65935dae1ed423c9f4e8aa29b2036df3de78d0aea43d07b1e8a1f2: failed to execute message; message index: 0: rpc error: code = FailedPrecondition desc = current block height (37) is greater than session claim window close height (36): claim attempted outside of the session's claim window: tx timed out"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3c419a4]

goroutine 276 [running]:
github.com/pokt-network/poktroll/pkg/relayer/session.(*sessionTree).Delete(0x400192a6e0)
	/Users/dk/pocket/poktroll/pkg/relayer/session/sessiontree.go:285 +0x344
github.com/pokt-network/poktroll/pkg/relayer/session.(*relayerSessionsManager).deleteExpiredSessionTreesFn.func1({0x52d5290, 0x400197fd40}, {0x4000e76a00, 0x1, 0x1})
	/Users/dk/pocket/poktroll/pkg/relayer/session/session.go:478 +0x278
github.com/pokt-network/poktroll/pkg/observable/channel.ForEach[...].func1({0x4000e76a00, 0x1, 0x1})
	/Users/dk/pocket/poktroll/pkg/observable/channel/map.go:103 +0x6c
github.com/pokt-network/poktroll/pkg/observable/channel.goMapTransformNotification[...]({0x52d5290, 0x400197fd40}, {0x52ce590, 0x4000b71ec0}, 0x400013f860, 0x400013f8c0, 0x4000b9c9a0)
	/Users/dk/pocket/poktroll/pkg/observable/channel/map.go:125 +0xc4
created by github.com/pokt-network/poktroll/pkg/observable/channel.Map[...] in goroutine 1
	/Users/dk/pocket/poktroll/pkg/observable/channel/map.go:24 +0x318
[event: pod relayminer1-687547c69f-lvc5h] Container image "poktrolld:tilt-c8d80bb2e7daf0e1" already present on machine

@okdas
Copy link
Member

okdas commented Oct 29, 2024

Okaaay, seems like there's another issue that breaks the network that we are going to need to address before upgrade. Looking into this as well:


12:34AM INF Timed out dur=14979.481981 height=60 module=consensus round=0 step=RoundStepNewHeight
12:34AM INF received proposal module=consensus proposal="Proposal{60/0 (E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918:1:7CE673CD6F5A, -1) 3376100465F5 @ 2024-10-29T00:34:58.805851558Z}" proposer=A6B0BAD7039843C118CFC588D5A6D38C459B9C25
12:34AM INF received complete proposal block hash=E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918 height=60 module=consensus
12:34AM INF finalizing commit of block hash=E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918 height=60 module=consensus num_txs=0 root=8CF58F38B7F1DC22E6E227E7F74885A80B061E11ED20CA106E2E513553BF7113
12:34AM INF Stored block hash at height 60 EndBlock=SessionModuleEndBlock module=x/session
12:34AM INF found 1 expiring claims at block height 60 method=SettlePendingClaims module=x/tokenomics
12:34AM INF claim does not require proof due to claimed amount (1048950upokt) being less than the threshold (20000000upokt) and random sample (0.35) being greater than probability (0.25) method=proofRequirementForClaim module=server
12:34AM INF Claim by supplier pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj IS WITHIN LIMITS of servicing application pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4. Max claimable amount >= Claim amount: 6663868upokt >= 1048950 application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 helper=ensureClaimAmountLimits method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF About to start processing TLMs for (24975) compute units, equal to (1048950upokt) claimed actual_settlement_upokt=1048950upokt application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF Starting TLM processing: "TLMRelayBurnEqualsMint" actual_settlement_upokt=1048950upokt application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF sent 1048950upokt from the supplier module to the supplier shareholder with address "pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj" method=distributeSupplierRewardsToShareHolders module=x/tokenomics
12:34AM INF distributed 1048950 uPOKT to supplier "pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj" shareholders method=distributeSupplierRewardsToShareHolders module=x/tokenomics
12:34AM ERR error processing token logic modules for claim "77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c": TLM "TLMRelayBurnEqualsMint": burning 1048950upokt from the application module account: spendable balance 958026upokt is smaller than 1048950upokt: insufficient funds [cosmos/[email protected]/x/bank/keeper/send.go:278]: failed to burn uPOKT from application module account [/Users/dk/go/pkg/mod/cosmossdk.io/[email protected]/errors.go:155]: failed to process TLM [/Users/dk/go/pkg/mod/cosmossdk.io/[email protected]/errors.go:155] claimed_upokt=1048950upokt module=server num_claim_compute_units=24975 num_estimated_compute_units=24975 num_relays_in_session_tree=24975 proof_requirement=NOT_REQUIRED session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator_address=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM ERR could not settle pending claims due to error TLM "TLMRelayBurnEqualsMint": burning 1048950upokt from the application module account: spendable balance 958026upokt is smaller than 1048950upokt: insufficient funds [cosmos/[email protected]/x/bank/keeper/send.go:278]: failed to burn uPOKT from application module account [/Users/dk/go/pkg/mod/cosmossdk.io/[email protected]/errors.go:155]: failed to process TLM [/Users/dk/go/pkg/mod/cosmossdk.io/[email protected]/errors.go:155] method=EndBlocker module=x/tokenomics
12:34AM ERR CONSENSUS FAILURE!!! err="runtime error: invalid memory address or nil pointer dereference" module=consensus stack="goroutine 180 [running]:\nruntime/debug.Stack()\n\t/opt/homebrew/Cellar/go/1.23.2/libexec/src/runtime/debug/stack.go:26 +0x64\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:801 +0x4c\npanic({0x3f299c0?, 0x713b210?})\n\t/opt/homebrew/Cellar/go/1.23.2/libexec/src/runtime/panic.go:785 +0xf0\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock.func1()\n\t/Users/dk/go/pkg/mod/github.com/cosmos/[email protected]/baseapp/abci.go:860 +0x124\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock(0x4000223208, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cosmos/[email protected]/baseapp/abci.go:892 +0x374\ngithub.com/cosmos/cosmos-sdk/server.cometABCIWrapper.FinalizeBlock({{0xffff74564168, 0x4001081308}}, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cosmos/[email protected]/server/cmt_abci.go:44 +0x54\ngithub.com/cometbft/cometbft/abci/client.(*localClient).FinalizeBlock(0x400185df20, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/abci/client/local_client.go:185 +0xf8\ngithub.com/cometbft/cometbft/proxy.(*appConnConsensus).FinalizeBlock(0x40015806a8, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/proxy/app_conn.go:104 +0x1d0\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).applyBlock(_, {{{0xb, 0x0}, {0x40013a2cb9, 0x7}}, {0x40013a2ce0, 0x8}, 0x1, 0x3b, {{0x400534e5a0, ...}, ...}, ...}, ...)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/state/execution.go:224 +0x3c0\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).ApplyVerifiedBlock(_, {{{0xb, 0x0}, {0x40013a2cb9, 0x7}}, {0x40013a2ce0, 0x8}, 0x1, 0x3b, {{0x400534e5a0, ...}, ...}, ...}, ...)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/state/execution.go:202 +0xd8\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0x4001729188, 0x3c)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:1772 +0xd50\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0x4001729188, 0x3c)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:1682 +0x2c0\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:1617 +0xb8\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0x4001729188, 0x3c, 0x0)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:1655 +0xd90\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0x4001729188, 0x4002e89a00, {0x0, 0x0})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:2335 +0x26c0\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0x4001729188, 0x4002e89a00, {0x0, 0x0})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:2067 +0x50\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0x4001729188, {{0x529e7c0, 0x40016261d8}, {0x0, 0x0}})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:929 +0x5c0\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0x4001729188, 0x0)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:856 +0x5fc\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 1\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:398 +0x1e4\n"
12:34AM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/root/.poktroll/data/cs.wal/wal
12:34AM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/root/.poktroll/data/cs.wal/wal

@Olshansk
Copy link
Member Author

@red-0ne Can you soft-confirm if the last one should be solved by the PRs you have open right now?

If so:

  1. Which one?
  2. Double-checking that there's on-chain safety against this?

@okdas
Copy link
Member

okdas commented Nov 18, 2024

I can confirm that the last issue has been addressed. Didn't have a chance to run a larger test last week, so need to do it this wee.k.

@okdas
Copy link
Member

okdas commented Jan 28, 2025

Haven't posted updates here in a while. Most of the current issues are tracked in the notion document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra Infra or tooling related improvements, additions or fixes scalability tooling Tooling - CLI, scripts, helpers, off-chain, etc...
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants