Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic migrations #9616

Closed
3 of 4 tasks
robert-zaremba opened this issue Jun 30, 2021 · 20 comments
Closed
3 of 4 tasks

Atomic migrations #9616

robert-zaremba opened this issue Jun 30, 2021 · 20 comments
Labels
C:Cosmovisor Issues and PR related to Cosmovisor C:Store C:x/upgrade

Comments

@robert-zaremba
Copy link
Collaborator

robert-zaremba commented Jun 30, 2021

Summary

Today, migrations are not always atomic. If a migration fails, we don't have an easy solution to rollback the changes.

Problem Definition

In-place migration are great addition to the SDK. However if not tested carefully, they can cause extensive problems to a node admin.
When running a migration, changes are written to the disk.

  • if we add a new store, then it's committed immediately
  • if someone will not use cache wrapped store with open and commit phase (eg because it would be too operational memory consuming).

If a migration fails with an operation listed above, we are leaving a node in a corrupted, possibly unmanageable state. The only way to recover is to sync from another healthy node, or restore a backup.

Proposal

We need to find a more friendly mechanism to handle backup.

ADR-40 will make it easy because we use a DB level checkpoints. But that won't be implemented in 0.44.

Few proposals:

  1. flag to cosmovisor to copy the DB (that will basically double the disc requirements).
  2. add more options to migration
  3. filesystem level rollback based on journal
  4. implement ADR-40 based checkpoint mechanism in store - that would limit the number of supported databases and would require DB migration for many nodes.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@robert-zaremba
Copy link
Collaborator Author

When thinking more about it, the option (4.) is the best. The option (1) is the easiest.

@robert-zaremba
Copy link
Collaborator Author

@zmanian , @jackzampolin -- I guess in the past this was not a problem because we were always exporting the state (genesis) and it took a time anyway.

@robert-zaremba robert-zaremba self-assigned this Jun 30, 2021
@alexanderbez
Copy link
Contributor

You could do (1) and (4), doing (1) first and if you think you could do (4) in a reasonable amount of time. If not, then at least you can fallback on (1).

@aaronc
Copy link
Member

aaronc commented Jun 30, 2021

I think it is helpful to clarify that they are only non atomic if KVStores are being added, renamed or deleted. That will only happen if modules are added or removed due to the multi store design. If there are no new modules and just migration code, the migrations will be atomic.

@zmanian
Copy link
Member

zmanian commented Jul 2, 2021

yeah the way this was achieved in the past was via state exports. This is a good and important observation. We definitely need a checkpoint and recovery mechanism.

Can we get some kind of primitive snapshot functionality into cosmovisor before the cosmoshub upgrade?

@alexanderbez
Copy link
Contributor

@zmanian I think the easiest approach for now is for operators to just take a backup of their data directory prior to upgrading.

@zmanian
Copy link
Member

zmanian commented Jul 2, 2021

As a practical matter this would mean telling people not to use cosmovisor during the upgrade and I've been asked to tell people to use cosmovisor.

@alexanderbez
Copy link
Contributor

alexanderbez commented Jul 2, 2021

I think they could/should still use it. In the case of an error, they'd have to wipe the resulting data directory and restart cosmovisor?

@anilcse
Copy link
Collaborator

anilcse commented Jul 2, 2021

I think they could/should still use it. In the case of an error, they'd have to wipe the resulting data directory and restart cosmovisor?

Yes, there's no straight way to rollback in case of upgrade failures. So need to wipe off the data and use backup/or resync.

may be, one quick fix we can add to the cosmovisor is, take a backup if the disk space is available or via a ENV setting, defaults to true. If it's set to false, cosmovisor can just try upgrading and in case of failure, node operators can fix it manually later. In other case, where backup is available, in case of failure, cosmovisor will use skip-upgrade and restart with old binary.

@anilcse
Copy link
Collaborator

anilcse commented Jul 2, 2021

I think they could/should still use it. In the case of an error, they'd have to wipe the resulting data directory and restart cosmovisor?

Yes, there's no straight way to rollback in case of upgrade failures. So need to wipe off the data and use backup/or resync.

may be, one quick fix we can add to the cosmovisor is, take a backup if the disk space is available or via a ENV setting, defaults to true. If it's set to false, cosmovisor can just try upgrading and in case of failure, node operators can fix it manually later. In other case, where backup is available, in case of failure, cosmovisor will use skip-upgrade and restart with old binary.

Or may be simply,
UNSAFE_UPGRADE:

  • If true, the cosmovisor will try to upgrade without any backup
  • if false (default), cosmovisor will try to take backup and then upgrade. In case of failure while taking backup, it will just halt the process there and won't try the upgrade.

@alexanderbez
Copy link
Contributor

ACK

@aaronc
Copy link
Member

aaronc commented Jul 2, 2021

@ethanfrey I wonder if you have any insights on some way we might be able to provide a clean rollback?

I do agree though that the safest thing is always to do a backup so having that as the default like @anilcse is suggesting seems wise.

@aaronc aaronc changed the title Atomic migrations Atomic Cosmovisor migrations Jul 2, 2021
@aaronc
Copy link
Member

aaronc commented Jul 2, 2021

Also, are we rolling back the panic on empty minimum gas prices which caused the upgrade failures @anilcse @robert-zaremba ? Seems like we should just emit a warning until the next release.

@amaury1093
Copy link
Contributor

Also, are we rolling back the panic on empty minimum gas prices which caused the upgrade failures @anilcse @robert-zaremba ? Seems like we should just emit a warning until the next release.

this just got merged

@robert-zaremba
Copy link
Collaborator Author

Let's start with option 1. - full backup controlled by a flag.
And later we will use snapshot mechanism.

@robert-zaremba
Copy link
Collaborator Author

BTW: some upgrades can take a lot of memory (eg if we were to update all bank managed storage) if we keep the cash wrapper.

@ryanchristo ryanchristo added this to the cosmovisor v1.0 milestone Jul 7, 2021
@robert-zaremba robert-zaremba changed the title Atomic Cosmovisor migrations Atomic migrations Jul 30, 2021
@robert-zaremba robert-zaremba removed this from the cosmovisor v0.1 milestone Jul 30, 2021
@robert-zaremba robert-zaremba added this to the v0.44 milestone Jul 30, 2021
@robert-zaremba
Copy link
Collaborator Author

robert-zaremba commented Jul 30, 2021

Oh, @aaronc - I've just noticed that you added "Cosmovisor" to the title and label. I think it's not fully related to the cosmovisor, however it will be partially solved with:

And fully solved with ADR-40 snapshots.

@aaronc
Copy link
Member

aaronc commented Jul 30, 2021

Right, sorry about that.

mergify bot pushed a commit that referenced this issue Aug 5, 2021
<!--
The default pull request template is for types feat, fix, or refactor.
For other templates, add one of the following parameters to the url:
- template=docs.md
- template=other.md
-->

## Description

Ref: #9616 (comment)

depends: #8590

<!-- Add a description of the changes that this PR introduces and the files that
are the most critical to review. -->

This PR adds a full backup option for cosmovisor. 
`UNSAFE_SKIP_BACKUP` is an `env` setting introduced newly.
- if `false` (default, **recommended**), cosmovisor will try to take backup and then upgrade. In case of failure while taking backup, it will just halt the process there and won't try the upgrade.
- If `true`, the cosmovisor will try to upgrade without any backup. This setting makes it hard to recover from a failed upgrade. Node operators either need to sync from a healthy node or use a snapshot from others.

---

### Author Checklist

*All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.*

I have...

- [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title
- [ ] added `!` to the type prefix if API or client breaking change
- [ ] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting))
- [x] provided a link to the relevant issue or specification
- [ ] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules)
- [ ] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing)
- [ ] added a changelog entry to `CHANGELOG.md`
- [x] included comments for [documenting Go code](https://blog.golang.org/godoc)
- [x] updated the relevant documentation or specification
- [x] reviewed "Files changed" and left comments if necessary
- [ ] confirmed all CI checks have passed

### Reviewers Checklist

*All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.*

I have...

- [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title
- [ ] confirmed `!` in the type prefix if API or client breaking change
- [ ] confirmed all author checklist items have been addressed 
- [ ] reviewed state machine logic
- [ ] reviewed API design and naming
- [ ] reviewed documentation is accurate
- [ ] reviewed tests and test coverage
- [ ] manually tested (if applicable)
@github-actions github-actions bot added the stale label Sep 14, 2021
@amaury1093 amaury1093 added pinned and removed stale labels Sep 21, 2021
@amaury1093 amaury1093 reopened this Sep 21, 2021
@tac0turtle tac0turtle added R:0.46 and removed pinned labels May 9, 2022
@tac0turtle tac0turtle moved this to 📝 Todo in Cosmos-SDK May 12, 2022
@tac0turtle tac0turtle removed the R:0.46 label May 27, 2022
@tac0turtle tac0turtle removed this from the v0.46 milestone May 27, 2022
@tac0turtle
Copy link
Member

closing as this is touched on in the current store discussions

Repository owner moved this from 📝 Todo to 👏 Done in Cosmos-SDK Oct 21, 2022
@robert-zaremba
Copy link
Collaborator Author

The issue is still present, so I think we should keep it open until resolved.

@tac0turtle tac0turtle removed this from Cosmos-SDK Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:Cosmovisor Issues and PR related to Cosmovisor C:Store C:x/upgrade
Projects
No open projects
Development

No branches or pull requests

8 participants