-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaia 3.0 takes a long time to start on migrated hub3 state #7682
Comments
Looks like we need to do some profiling. |
I will try to look at it.
@zmanian What do you mean with a genesis file? Is this a migrated snapshot? |
Yes that file is a migrated snapshot. |
No, just this migrates genesis file. |
One reason for the snapshot is that for if someone wants to redo the export as part of the testing process. This export was not a "zero height" export. There is more staking state to process as result and that might be a factor in the long start up time. |
I started looking at it today:
What's the expected time to finish the state sync normally (what was the state import time on the previous update) ? This process consumes 2CPU threads all the time. |
It's also worth to mention that the 2 CPU threads I mentioned above are mostly used by: |
UPDATE, I didn't a second run and killed the keyring daemon. It was started by the simapp and for some reason was keeping 100% busy. (I was running simapp and gaia in parallel). |
Typically what we've done to debug these regressions is enable cpu profiling in the binary. |
All we need is a simple pprof here to see where the bottleneck is. Loading genesis state into memory shouldn't take too long (we've done this on previous upgrades). |
Note about profiling optionsCosmos-SDK app has an option to enable cpu profile through PreliminariesThe loading get stuck during the Tendermint Node creation -> when it calls the app to initialize the genesis. This starts in server/startInProcess. The reason I'm writing about it is because it proceeds other routines to start, like I linked the gaiav3 code in the cosmos-sdk module to easily modify the gaia or cosmos-sdk code and build it together. Following that I added a new goroutine to create cpu pprofile. Action Points:I will create a PR to change an order of some of the calls to enable at lest CPU profile before we will starting the TM node. ProfilingAs noted above, the InitChain process is the crux. In attachment you can find 2 SVG files depicting a CPU profile. Unfortunately GitHub doesn't support attaching SVG files, so I had to pack them in a ZIP. Top called functionsStats for 2min run
Stats for 10min run
Findings:
Next stepsI removed a Attachments |
I modified slightly the code, and it seams that there is something wrong with IAVL iterator (see the new profiler graph). Looks that setting a new balances for each account takes only a fraction of time of removing the balances. Both operations update the merkle paths, so I'm wondering if there were any related changes in IAVL. Is anyone aware about it? Should we inspect it further here? |
I'm wondering if we should hitting the cacheStore at all during initChain? |
@alexanderbez here is the latest pprof. |
@zmanian - so what is the loading time we should anticipate? what is the "baseline"? I will do a PR with the x/bank optimization. |
Well a 2 hour node start up time is too much of a usability issue for the hub to upgrade to stargate. I think we need to complete InitChain < 10 min. |
Well, there have been a lot of changes since the version the Hub is currently running, so it's hard to say. But we know that IAVL has pretty poor performance in general, so it's not entirely unexpected that importing the state is taking a long time. I'll dig into this a bit more tomorrow and see if there's any low-hanging fruit. I'll also mention that we now have export/import APIs in IAVL, which should be much faster at importing data than creating nodes one-by-one, but this was introduced in more recent versions than the Hub is running. |
So a bulk of this time is coming from We should run a profile @robert-zaremba with invariant checking off. I believe you can pass this as a CLI flag or config? If not, just hard-code it to zero during app construction. |
If we could make invariant checking optional during node startup we might be okay |
I already did that earlier today. copy-pasting the response from discord: @zmanian - currently it's not configurable to disable the invariant checks. I will add that option tomorrow. |
@erikgrinaker - I don't know if directly loading to IAVL is possible. We are loading a snapshot using JSON to the modules. That JSON is converted upfront - the object serialization in 0.40.x changed. |
Ah, right, good point. Direct IAVL import would basically require the data to remain unchanged, since the exported data contains the internal IAVL tree structure. It might be possible to change the values as long as keys don't change, but I wouldn't really recommend modifying it. |
okay. I think 10-15 min is survivable. |
thanks guys! |
I had a closer look, and as far as I can tell this isn't related to IAVL at all, but rather the Let me know if I've missed anything, or you'd like a hand with optimizing |
Thanks @erikgrinaker . I will see what's in |
@zmanian , @jackzampolin - I've pushed the update both on the gaia and cosmos-sdk branch to make the crisis invariants check customizable. This is done in the app constructor -
Let me know if this is OK, or if you want to make it customizable through a CLI parameter. |
Interesting. I'll take a look at |
Here is a PR for the Cosmos-SDK (new branch, previous one was causing conflicts to merge to master): #7764 |
Here is the Gaia-3 PR: cosmos/gaia#487 |
Here is the followup task: #7766 |
We are using the https://github.com/cosmos/gaia/tree/jack/gaiav3.0 branch
This is a snapshot of block 3557667 of the Cosmoshub. This export is available here: https://storage.googleapis.com/stargate-genesis/3557667.cosmos_hub_3.json
There is a full copy of a cosmos hub full node here: https://storage.googleapis.com/stargate-genesis/snapshot.tgz
Using Gaia 2.0 and this cosmos node above,
gaiad export > 3557667.cosmos_hub_3.json
Using the migrate public keys from the Stargate repo:
This genesis file can be downloaded from https://storage.googleapis.com/stargate-genesis/cosmoshub-test-stargate.json
gaiad start
takes about 2 hours for InitChain to run.For Admin Use
The text was updated successfully, but these errors were encountered: