Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hydra (simplified) Tail Simulation + Chain Analysis #16

Merged
merged 27 commits into from
May 21, 2021

Conversation

KtorZ
Copy link
Contributor

@KtorZ KtorZ commented May 18, 2021

(Simplified) Tail Protocol Simulation

See exe/tail.

The tail protocol simulation works in two steps: preparation and run.

Preparation

The preparation generates clients events from a set of parameters. That is, one can create a simulation over a certain period of time, from a certain number of clients with certain behavior. For example:

$ hydra-tail-simulation prepare \
  --number-of-clients 1000 \
  --duration 60 \
  --client-online-likelihood 50%100 \
  --client-submit-likelihood 10%100 
  events.csv

PrepareOptions
    { numberOfClients = 1000
    , duration = SlotNo 60
    , clientOptions = ClientOptions
        { onlineLikelihood = 1 % 2
        , submitLikelihood = 1 % 10
        }
    }

The events can then be fed into the simulation for execution. Having both step separated allows for creating events from other sources (e.g. from a real network) while the prepare command can be used to establish some baseline on simple patterns. Note that the prepare command is fully deterministic. Same options yield exactly the same events.

Execution

To run a simulation, simply provide an events dataset with possibly some custom options for the server:

$ hydra-tail-simulation run \
  --slot-length 1s \
  --server-region LondonAWS \
  --server-read-capacity 102400 \
  --server-write-capacity 102400 \
  events.csv

RunOptions
    { slotLength = 1 s
    , serverOptions = ServerOptions
        { region = LondonAWS
        , readCapacity = 102400 KBits/s
        , writeCapacity = 102400 KBits/s
        }
    }
SimulationSummary
    { numberOfClients = 1000
    , numberOfEvents = 33567
    , lastSlot = SlotNo 60
    }
Analyze
    { realThroughput = 50.5261490853439
    , maxThroughput = 50.583333333333336
    , numberOfTransactions = 12491
    }

The simulation outputs two numbers: a maximum throughput and a real throughput. The real throughput is calculated by looking at the (simulated) time it took to run the simulation, compared to the max throughput which is the best that the server could achieve given the inputs (or said differently, the actual traffic generated by all clients).


Hydra Tail Simulation Scripts

This folder contains a few scripts can be used to generate data-sets to inject in the simulation. It works as a pipeline of Node.js streams using real blockchain data obtained from the Mainnet.
Why Node.js? Because JavaScript and JSON are quite convenient to rapidly prototype something and transform data on the fly.

How to use

$ yarn install
$ yarn pipeline 1000 10

The first argument given to pipeline corresponds to the number of clients considered for generating events, whereas the second correspond to the compression rate of the chain (10 means that we only count 1 slot every 10 slots).

NOTE (1): If you haven't downloaded the chain locally, you'll need to install and setup an Ogmios server to download blocks from the chain. The script assumes a local instance up-and-running with the default configuration.

NOTE (2): The entire Cardano chain since the beginning of Shelley spreads across ~1.2M blocks. The various intermediate representations are quite voluminous but the final output is quite compact (for it is a CSV file). On a decent CPU, it takes about 3 minutes to run the whole pipeline with a new set of parameter, assuming the blockchain has already been downloaded.

NOTE (3): The pipeline is single-cored, but multiple pipelines can be ran at once to help generating multiple datasets with different parameters. The output filenames are automatically generated from the script's arguments.

Steps Overview

  1. downloadChain (~1.2M blocks)

    Downloads the blockchain from a certain point (by default, from the first Shelley block and onwards). It'll download it both into a file and into a readable stream that is passed to the rest of the pipeline such that (a) The script runs in somewhat constant memory usage, (b) The pipeline produces data immediately.

    The file produced is rather voluminous (4.5GB+) and will contain line-separated JSON blocks like this (formatted over multiple-lines for readability):

    {
        "headerHash": "b51b1605cc27b0be3a1ab07dfcc2ceb0b0da5e8ab5d0cb944c16366edba92e83",
        "header": {
            "blockHeight": 4490515,
            "slot": 4492900,
            "prevHash": "23fd3b638e8f286978681567d52597b73f7567e18719cef2cbd66bba31303d98",
            "issuerVk": "5fddeedade2714d6db2f9e1104743d2d8d818ecddc306e176108db14caadd441",
            "issuerVrf": "axwYeh90N9B55BQwtqn8eymybovJxGco5VE6kwTyIm8=",
            "blockSize": 1053,
            "blockHash": "f8ffe66aeeac127f30b8672857c4f6b8cb29c9ed24267104619a985105e22ba0"
        },
        "body": [
            {
                "id": "79acf08126546b68d0464417af9530473b8c56c63b2a937bf6451e96e55cb96a",
                "body": {
                    "inputs": [
                        {
                            "txId": "397eb970e7980e6ac1eb17fcb26a8df162db4e101f776138d74bbd09ad1a9dee",
                            "index": 0
                        },
                        ...
                    ],
                    "outputs": [
                        {
                            "address": "addr1qx2kd28nq8ac5prwg32hhvudlwggpgfp8utlyqxu6wqgz62f79qsdmm5dsknt9ecr5w468r9ey0fxwkdrwh08ly3tu9sy0f4qd",
                            "value": 402999781127
                        },
                        ...
                    ],
                    "certificates": [],
                    "withdrawals": {},
                    "fee": 218873,
                    "timeToLive": 4500080,
                    "update": null
                },
                "metadata": {
                    "hash": null,
                    "body": null
                }
            }
        ]
    }
  2. viewViaStakeKeys (~5M transactions & ~700K wallets)

    Extract transactions from each blocks and transform them so that inputs and outputs are directly associated with their corresponding stake keys. Indeed, since the beginning of the Shelley era,
    most wallets in Cardano use full delegation addresses containing both a payment part and a delegation part, but uses a single stake key per wallet. Thus, by looking at stake key hashes from
    addresses it is possible to track down (Shelley) wallets with a quite good accuracy. This second step does exactly just that, while also trimming out informations that aren't useful for the simulation. This stream transformer produces chunks of line-separated JSON "transactions" which look like the following (formatted over multiple-lines for readability):

    {
        "ref": "79acf08126546b68d0464417af9530473b8c56c63b2a937bf6451e96e55cb96a",
        "size": 1443,
        "inputs": [
            null,
            null,
            null,
            null,
            null
        ],
        "outputs": [
            {
                "wallet": "f79qsdmm5dsknt9ecr5w468r9ey0fxwkdrwh08ly3tu9s",
                "value": 402999781127
            },
            {
                "wallet": "f79qsdmm5dsknt9ecr5w468r9ey0fxwkdrwh08ly3tu9s",
                "value": 39825492736
            },
            {
                "wallet": "f79qsdmm5dsknt9ecr5w468r9ey0fxwkdrwh08ly3tu9s",
                "value": 1999822602
            },
            {
                "wallet": "f79qsdmm5dsknt9ecr5w468r9ey0fxwkdrwh08ly3tu9s",
                "value": 1000000
            },
            {
                "wallet": "f79qsdmm5dsknt9ecr5w468r9ey0fxwkdrwh08ly3tu9s",
                "value": 100000000000
            }
        ],
        "slot": 4492900
    }

    Inputs or outputs marked as null correspond to either Byron addresses or Shelley addresses with no stake part whatsoever.

  3. createEvents (~ XXX client events)

This steps consumes the stream of transactions created by viewViaStakeKeys and create Hydra Tail Simulation client events by assigning client ids to each stake key.
The number of client ids is however limited and rotates. The effect creates a long stream of transactions but across a vastly smaller set of wallets / clients (the
main chain has about ~700.000 wallets identified by stake keys, and this pipeline step compress them down to ~1000). It also get rid of unknown inputs / outputs and
keep transactions even simpler.

It generate line-separated JSON events as such (note that there's always a 'Pull' event added for every 'NewTx'):

{"slot":0,"from":986,"msg":"Pull"}
{"slot":0,"from":986,"msg":{"NewTx":{"ref":"f746a18d6a17acf111109ff9a35a8c4bd130f73697188edd2d367cea5efe98a2","size":297,"recipients":[987],"amount":1002000000}}}
{"slot":1,"from":732,"msg":"Pull"}
{"slot":1,"from":732,"msg":{"NewTx":{"ref":"9e383d78de88fed8e222480f2f24766aa919038e3d238afb40d383e3e5069675","size":297,"recipients":[733],"amount":10000000}}}
  1. lineSeparatedFile

This final steps format events as CSV and put one event per line in a rather compact format. Note that it also drop the transaction reference to save space and
because a unique identifier can be derived from a simple counter / line number of the corresponding event.

The final format looks like this the following:

slot  , clientId , event  , size , amount      , recipients
63025 , 28       , pull   ,      ,             ,
63025 , 28       , new-tx , 297  , 2000000     , 632
63031 , 156      , pull   ,      ,             ,
63031 , 156      , new-tx , 5212 , 1391209719  ,
63031 , 157      , pull   ,      ,             ,
63031 , 157      , new-tx , 232  , 148834411   , 158
63034 , 942      , pull   ,      ,             ,
63034 , 942      , new-tx , 320  , 23000000000 ,
63037 , 772      , pull   ,      ,             ,
63037 , 772      , new-tx , 287  , 5455000055  ,

The last column recipients contains a space-separated list of recipients (or subscribers for a particular transaction). It may contains 0, 1 or many elements.
The size, amount and recipients columns are always empty for pull events.

KtorZ added 21 commits May 7, 2021 15:32
  This doesn't yet impact the server behavior much because the server multiplexer isn't aware of whether a client is online or offline. So this we need to make sure that the server can notice a client disconnection and act accordingly. We may maybe model a client disconnection / reconnection with a 0-size network message?
  - Now go offline _immediately_ after waking up
  - Now have a non-zero probability of not sending any transaction when they go online
…t at the right time

  This speeds up the pipeline from ~21 hours down to ~3 minutes. I also generated new datasets with higher compression rates (1:10000 & 1:100000)
@KtorZ KtorZ requested a review from kantp May 18, 2021 13:57
@KtorZ KtorZ self-assigned this May 18, 2021
Copy link
Contributor

@kantp kantp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did a call to go through the code and review it.

Great work, thank you Matthias!

KtorZ added 4 commits May 19, 2021 10:34
…nect server handlers.

  Also renamed 'subscribers' to 'recipients' to make it a bit clearer.
  There was initially a concept of subscriptions understood from the
  draft Tail paper, but this is still a blurry concept and we ended up
  associating transactions with their recipients directly.
@KtorZ
Copy link
Contributor Author

KtorZ commented May 19, 2021

Corrected a few points as discussed in this morning's review & in yesterday's call with the research team:

  • 40a0cd9
    📍 Add missing simulated lookup computations in Pull, Connect and Disconnect server handlers.
    Also renamed 'subscribers' to 'recipients' to make it a bit clearer.
    There was initially a concept of subscriptions understood from the
    draft Tail paper, but this is still a blurry concept and we ended up
    associating transactions with their recipients directly.

  • 82d9f14
    📍 Plot transaction volume in USD using conversion rates from CoinGecko

  • e2274b3
    📍 Fix off-by-one error on time calculation.

  • 35b5f7c
    📍 discard self made or byron transactions from the datasets.

  - Add a '--concurrency' option to the command-line
  - Measure actual network usage (read / write) in the simulation analysis
  - Some renaming (real -> actual) + moved numberOfTransactions from the analysis to the simulation summary
@KtorZ KtorZ merged commit 07058a9 into master May 21, 2021
@KtorZ KtorZ deleted the Ktorz/hydra-tail-simulation branch May 21, 2021 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants