Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaia 3.0 takes a long time to start on migrated hub3 state #7682

Closed
4 tasks
zmanian opened this issue Oct 26, 2020 · 32 comments · Fixed by #7764
Closed
4 tasks

Gaia 3.0 takes a long time to start on migrated hub3 state #7682

zmanian opened this issue Oct 26, 2020 · 32 comments · Fixed by #7764
Assignees
Labels

Comments

@zmanian
Copy link
Member

zmanian commented Oct 26, 2020

We are using the https://github.com/cosmos/gaia/tree/jack/gaiav3.0 branch

This is a snapshot of block 3557667 of the Cosmoshub. This export is available here: https://storage.googleapis.com/stargate-genesis/3557667.cosmos_hub_3.json

There is a full copy of a cosmos hub full node here: https://storage.googleapis.com/stargate-genesis/snapshot.tgz

Using Gaia 2.0 and this cosmos node above, gaiad export > 3557667.cosmos_hub_3.json

Using the migrate public keys from the Stargate repo:

build/gaiad migrate ~/3557667.cosmos_hub_3.json --chain-id=cosmoshub-test-stargate --initial-height 3557668 

This genesis file can be downloaded from https://storage.googleapis.com/stargate-genesis/cosmoshub-test-stargate.json

gaiad start takes about 2 hours for InitChain to run.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@zmanian zmanian added the T:Bug label Oct 26, 2020
@aaronc
Copy link
Member

aaronc commented Oct 26, 2020

Looks like we need to do some profiling.

@clevinson clevinson added this to the v0.40 [Stargate] milestone Oct 26, 2020
@robert-zaremba
Copy link
Collaborator

I will try to look at it.
I guess the full copy of a cosmos hub full node (100GB+) is not needed?

This genesis file can be downloaded from https://storage.googleapis.com/stargate-genesis/cosmoshub-test-stargate.json

@zmanian What do you mean with a genesis file? Is this a migrated snapshot?

@zmanian
Copy link
Member Author

zmanian commented Oct 27, 2020

Yes that file is a migrated snapshot.

@aaronc
Copy link
Member

aaronc commented Oct 27, 2020

I will try to look at it.

I guess the full copy of a cosmos hub full node (100GB+) is not needed?

No, just this migrates genesis file.

@zmanian
Copy link
Member Author

zmanian commented Oct 27, 2020

One reason for the snapshot is that for if someone wants to redo the export as part of the testing process.

This export was not a "zero height" export. There is more staking state to process as result and that might be a factor in the long start up time.

@robert-zaremba
Copy link
Collaborator

I started looking at it today:

  1. simapp from master fails to load the JSON file
  2. gaia takes several minutes to parse the JSON, and lot of time to finish the state sync (I killed it after 3+h)
  3. I started porting the gaia into a branch of cosmos-sdk, to allow easy testing.

What's the expected time to finish the state sync normally (what was the state import time on the previous update) ? This process consumes 2CPU threads all the time.

@robert-zaremba
Copy link
Collaborator

It's also worth to mention that the 2 CPU threads I mentioned above are mostly used by: gnome-keyring-daemon and gaiad.
Shouldn't the key be loaded into the runtime and freeing the key manager? Probably this depends on the key manager. Some key managers will not give any access to its private keys.
Also, it's strange, because when doing a sync, we don't need a private key.... so we shouldn't need to access a key mananger.

@robert-zaremba
Copy link
Collaborator

UPDATE, I didn't a second run and killed the keyring daemon. It was started by the simapp and for some reason was keeping 100% busy. (I was running simapp and gaia in parallel).

@zmanian
Copy link
Member Author

zmanian commented Oct 28, 2020

Typically what we've done to debug these regressions is enable cpu profiling in the binary.

https://blog.golang.org/pprof

@alexanderbez
Copy link
Contributor

All we need is a simple pprof here to see where the bottleneck is. Loading genesis state into memory shouldn't take too long (we've done this on previous upgrades).

@robert-zaremba
Copy link
Collaborator

Note about profiling options

Cosmos-SDK app has an option to enable cpu profile through --cpu-profile. But it doesn't help here because we stuck before the profiling starts.

Preliminaries

The loading get stuck during the Tendermint Node creation -> when it calls the app to initialize the genesis. This starts in server/startInProcess. The reason I'm writing about it is because it proceeds other routines to start, like cpuProfile or rpc.pprof HTTP handler. Because of this, we can't use the builtin profiling options.
Instead, I had to add a new go routine and export the pprof profiles.

I linked the gaiav3 code in the cosmos-sdk module to easily modify the gaia or cosmos-sdk code and build it together.

Following that I added a new goroutine to create cpu pprofile.

Action Points:

I will create a PR to change an order of some of the calls to enable at lest CPU profile before we will starting the TM node.

Profiling

As noted above, the InitChain process is the crux. In attachment you can find 2 SVG files depicting a CPU profile. Unfortunately GitHub doesn't support attaching SVG files, so I had to pack them in a ZIP.
As you can see most of the work in this part is spent in x/bank/keeper InitGenesis . There are essentially 2 calls: ValidateBalance and SetBalances. It's important to note that we call these 2 functions ONLY on all accounts from the snapshot file. We can get rid of the first call, because the second call is doing k.ClearBalances(ctx, addr) on all accounts from the snapshot. This already drops the time by 50%.

Top called functions

Stats for 2min run

Showing top 15 nodes out of 127
      flat  flat%   sum%        cum   cum%
    34.54s 26.59% 26.59%     34.54s 26.59%  aeshashbody
    18.17s 13.99% 40.58%     18.17s 13.99%  memeqbody
    14.99s 11.54% 52.12%     65.20s 50.19%  runtime.mapaccess1_faststr
     5.66s  4.36% 56.47%     13.74s 10.58%  runtime.mapiternext
     5.22s  4.02% 60.49%     12.69s  9.77%  runtime.mallocgc
     4.07s  3.13% 63.63%    104.78s 80.66%  github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems
     3.10s  2.39% 66.01%      3.10s  2.39%  cmpbody
     2.93s  2.26% 68.27%      7.37s  5.67%  runtime.scanobject
     2.24s  1.72% 69.99%      2.24s  1.72%  runtime.memclrNoHeapPointers
     2.13s  1.64% 71.63%      2.47s  1.90%  runtime.findObject
     1.88s  1.45% 73.08%      1.89s  1.45%  github.com/enigmampc/btcutil/bech32.bech32Polymod
     1.73s  1.33% 74.41%      5.87s  4.52%  github.com/tendermint/tm-db.IsKeyInDomain
     1.61s  1.24% 75.65%     15.85s 12.20%  runtime.stringtoslicebyte

Stats for 10min run

Showing top 15 nodes out of 84
      flat  flat%   sum%        cum   cum%
   200.77s 31.38% 31.38%    200.77s 31.38%  aeshashbody
    98.02s 15.32% 46.69%     98.02s 15.32%  memeqbody
    91.30s 14.27% 60.96%    384.03s 60.02%  runtime.mapaccess1_faststr
    27.81s  4.35% 65.31%     67.29s 10.52%  runtime.mapiternext
    22.53s  3.52% 68.83%     52.17s  8.15%  runtime.mallocgc
    22.26s  3.48% 72.31%    574.32s 89.75%  github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems
    14.99s  2.34% 74.65%     14.99s  2.34%  runtime.evacuated (inline)
    14.54s  2.27% 76.92%     14.54s  2.27%  cmpbody
    11.09s  1.73% 78.66%     31.56s  4.93%  runtime.scanobject
    10.73s  1.68% 80.33%     11.78s  1.84%  runtime.findObject
     7.79s  1.22% 81.55%      7.79s  1.22%  runtime.memclrNoHeapPointers
     7.66s  1.20% 82.75%      7.66s  1.20%  runtime.nextFreeFast
     7.58s  1.18% 83.93%     75.39s 11.78%  runtime.stringtoslicebyte
     6.90s  1.08% 85.01%     61.88s  9.67%  runtime.rawbyteslice
     6.40s  1.00% 86.01%     25.14s  3.93%  github.com/tendermint/tm-db.IsKeyInDomain

Findings:

  • Most of the time is spent on 89% github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems
    I didn't check yet what this is doing. Sounds like cache management
  • Lot of time is spent in map function (the builtin map). Looks like we do lot of serialization into maps.
    • aeshashbody is a primary function use in go hash maps (maps in go).
    • I'm curious in particular about this one: 60.02% runtime.mapaccess1_faststr. Lot of map access (probably getting an item from a map).
  • Significant amount of time is spent on memory allocatoin: 9% it's really a lot!
  • 11% when converting string to []byte -- alos a lot!

Next steps

I removed a ClearBalances call and now running a long profiling to find something about other modules. For the moment I have a feeling that there is something wrong in IAVL.

Attachments

cpu-pprof-graphs.zip

@robert-zaremba
Copy link
Collaborator

I modified slightly the code, and it seams that there is something wrong with IAVL iterator (see the new profiler graph). Looks that setting a new balances for each account takes only a fraction of time of removing the balances. Both operations update the merkle paths, so I'm wondering if there were any related changes in IAVL. Is anyone aware about it?

Should we inspect it further here?

cpu-35min-mod.pdf

@zmanian
Copy link
Member Author

zmanian commented Oct 29, 2020

I'm wondering if we should hitting the cacheStore at all during initChain?

@robert-zaremba
Copy link
Collaborator

@alexanderbez here is the latest pprof.
@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?

cpu-35min.zip

@robert-zaremba
Copy link
Collaborator

robert-zaremba commented Oct 29, 2020

@zmanian - so what is the loading time we should anticipate? what is the "baseline"?

I will do a PR with the x/bank optimization.

@zmanian
Copy link
Member Author

zmanian commented Oct 29, 2020

Well a 2 hour node start up time is too much of a usability issue for the hub to upgrade to stargate.

I think we need to complete InitChain < 10 min.

@erikgrinaker
Copy link
Contributor

@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?

Well, there have been a lot of changes since the version the Hub is currently running, so it's hard to say. But we know that IAVL has pretty poor performance in general, so it's not entirely unexpected that importing the state is taking a long time. I'll dig into this a bit more tomorrow and see if there's any low-hanging fruit.

I'll also mention that we now have export/import APIs in IAVL, which should be much faster at importing data than creating nodes one-by-one, but this was introduced in more recent versions than the Hub is running.

@alexanderbez
Copy link
Contributor

@alexanderbez here is the latest pprof.
@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?

cpu-35min.zip

So a bulk of this time is coming from x/crisis invariant assertion, which operators are not obligated to run. Given that it takes so long to run, it is very worrisome. Thankfully it's not serialization related like it was in the past with Amino, but now something to do with IAVL most likely.

We should run a profile @robert-zaremba with invariant checking off. I believe you can pass this as a CLI flag or config? If not, just hard-code it to zero during app construction.

@alexanderbez
Copy link
Contributor

cpu-35min.zip

@zmanian
Copy link
Member Author

zmanian commented Oct 29, 2020

If we could make invariant checking optional during node startup we might be okay

@robert-zaremba
Copy link
Collaborator

We should run a profile @robert-zaremba with invariant checking off. I believe you can pass this as a CLI flag or config? If not, just hard-code it to zero during app construction.

I already did that earlier today. copy-pasting the response from discord:
Without invariant checking it's much faster. 10-15min. Most of the work is is done by an IAVL iterator in both the bank and crisis
Invariants check - on my computer it's about 30min

@zmanian - currently it's not configurable to disable the invariant checks. I will add that option tomorrow.

@robert-zaremba
Copy link
Collaborator

@erikgrinaker - I don't know if directly loading to IAVL is possible. We are loading a snapshot using JSON to the modules. That JSON is converted upfront - the object serialization in 0.40.x changed.

@erikgrinaker
Copy link
Contributor

@erikgrinaker - I don't know if directly loading to IAVL is possible. We are loading a snapshot using JSON to the modules. That JSON is converted upfront - the object serialization in 0.40.x changed.

Ah, right, good point. Direct IAVL import would basically require the data to remain unchanged, since the exported data contains the internal IAVL tree structure. It might be possible to change the values as long as keys don't change, but I wouldn't really recommend modifying it.

@zmanian
Copy link
Member Author

zmanian commented Oct 29, 2020

okay. I think 10-15 min is survivable.

@zmanian
Copy link
Member Author

zmanian commented Oct 29, 2020

thanks guys!

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 30, 2020

I had a closer look, and as far as I can tell this isn't related to IAVL at all, but rather the cachekv package in the SDK. There is some CPU usage of tm-db.IsKeyInDomain, but interestingly this function is not used at all in tm-db or IAVL (not sure why it's even there), it is only used in the SDK's cachekv package.

Let me know if I've missed anything, or you'd like a hand with optimizing cachekv.

@robert-zaremba
Copy link
Collaborator

Thanks @erikgrinaker . I will see what's in cachekv. Is new for me.

@robert-zaremba
Copy link
Collaborator

@zmanian , @jackzampolin - I've pushed the update both on the gaia and cosmos-sdk branch to make the crisis invariants check customizable. This is done in the app constructor - NewGaiaApp:

crisis.NewAppModule(&app.CrisisKeeper, false),

Let me know if this is OK, or if you want to make it customizable through a CLI parameter.
With invariants disabled, the genesis loading takes ~10min.

@alexanderbez
Copy link
Contributor

Interesting. I'll take a look at cachekv. But I don't recall any changes to the core kv store logic, so this is concerning.

@robert-zaremba
Copy link
Collaborator

Here is a PR for the Cosmos-SDK (new branch, previous one was causing conflicts to merge to master): #7764

@robert-zaremba
Copy link
Collaborator

Here is the Gaia-3 PR: cosmos/gaia#487

@robert-zaremba
Copy link
Collaborator

Here is the followup task: #7766

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment