Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set completion mark in correct timing #229

Merged
merged 2 commits into from
Jan 26, 2025

Conversation

hyeonLewis
Copy link
Contributor

@hyeonLewis hyeonLewis commented Jan 24, 2025

Proposed changes

When state migration has been finished, it removes old DB and dbm for state migration. But setting the completion mark inMigration = false is done after dbm.dbs[StateTrieMigrationDB] = nil in separate go-routine (no lock held), which leads to potential nil dereference due to timing issue while opening a new trie (since it accesses to dbm.dbs[StateTrieMigrationDB] if inMigration = true).

I strongly suspect it's the root cause of the random test failure of TestMigration_StartMigrationByMiscDBOnRestart in CI.

Types of changes

Please put an x in the boxes related to your change.

  • Bugfix
  • New feature or enhancement
  • Others

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING GUIDELINES doc
  • I have read the CLA and signed by comment I have read the CLA Document and I hereby sign the CLA in first time contribute
  • Lint and unit tests pass locally with my changes ($ make test)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Related issues

The CI failure always happens in #225.

Further comments

@hyeonLewis hyeonLewis self-assigned this Jan 24, 2025
@hyeonLewis hyeonLewis requested a review from hyunsooda January 24, 2025 00:26
@hyeonLewis hyeonLewis changed the title Mark completion mark when delete db Set completion mark in correct timing Jan 24, 2025
Copy link
Collaborator

@ian0371 ian0371 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on PR description, before accessing dbm.dbs[StateTrieMigrationDB], isn't the read lock checked? Could you show me which line?
Plus, this PR seems correct because the lock is not held in removeOldDB.

ian0371
ian0371 previously approved these changes Jan 24, 2025
@hyeonLewis
Copy link
Contributor Author

@ian0371 https://github.com/kaiachain/kaia/blob/dev/storage/database/db_manager.go#L1841. And the lock is not held in removeOldDB.

@hyunsooda
Copy link
Contributor

hyunsooda commented Jan 24, 2025

If the completion mark is forwarded back, previous bad ux may remain (klaytn/klaytn#1412). I think these three lines https://github.com/kaiachain/kaia/pull/229/files#diff-3f4e98143300d776da515f7e8784f500929242d2fd1705aa237cdb76aa84d638R836-R838 should be invoked in one operation. What do you think if the other two lines are moved together in L851?

@hyeonLewis
Copy link
Contributor Author

@hyunsooda Thanks for supplying contexts, and I think moving all three lines into defer in removeOldDB also works. @ian0371 What do you think?

@ian0371
Copy link
Collaborator

ian0371 commented Jan 24, 2025

We should figure out the dependencies of each resource, and figure out the deallocation order.

  1. Invalidate memory dbm.inMigration; dbm.inMigration = false
  2. Invalidate memory; dbm.dbs[StateTrieMigrationDB] = nil. Ensure that it is not accessed when dbm.inMigration is false.
  3. Remove DB; removeOldDB()
  4. Invalidate MigrationDB path; dbm.setDBDir(StateTrieMigrationDB, ""). Ensure that it is not accessed when dbm.inMigration is false.
  5. Write migration status; dbm.setStateTrieMigrationStatus(0)

So the invariant is that MigrationDB can be accessed only when dbm.inMigration is true.

If the program halts during step 3, resume from 3 on restart. Maybe we need to add "ShouldRemoveOldDB" to indicate that removing DB is not finished.

@hyeonLewis
Copy link
Contributor Author

@ian0371 @hyunsooda Currently, the node restarts the state migration if the program exits/halts before deleting the old DB. And if we call setStateTrieMigrationStatus(0) early, the issue lake solved occurred again.

The fundamental solution would be storing deletion info into DB as Ian's suggestion, and it can be handled in separate PR. I'll remain setDBDir to maintain the same behavior when restarting the node.

Before that, I patched to make dbs nil after inMigration = false. PTAL.

@hyunsooda
Copy link
Contributor

hyunsooda commented Jan 24, 2025

Yes the migration restarts if setStateTrieMigrationStatus(0) has not been called (even though I have a doubt old database will be eventually removed). The old fix seems to guarantee to show isMigration field to false before delete operation is done.

I belive that this commit 95afbd6 will solve the issue because the delete operation will be called at restart again.
Now I got another view. Reading the state trie from the StateTrieMigrationDB may be another inconsistency problem. Do you mean this problem?

@hyeonLewis
Copy link
Contributor Author

@hyunsooda Reading state trie from StateTrieMigrationDB must be prevented after isMigration = false. Could you provide which code has an inconsistency issue?

Copy link
Contributor

@hyunsooda hyunsooda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, got the point. Following PR expects that recorgnize the end of migration work and directly jump to remove database if it has not been deleted.

@ian0371
Copy link
Collaborator

ian0371 commented Jan 24, 2025

I think reverting file to pre-#88 is correct (for both FinishStateMigration and removeOldDB), and make removeOldDB goroutine continue on restart. (which was 3d22583 but need to fix more)

https://github.com/klaytn/klaytn/blob/5b050ee5243962da863b8bd627401f3d72d47a8f/storage/database/db_manager.go#L749-L751

If inMigration is true, reading and writing trie node happens on two DBs, trieDB and migrateDB.
At FinishStateMigration(), it means that migrateDB is ready to use and can replace trieDB. Thus, after calling FinishStateMigration(), there's no reason for inMigration flag to be true. Since #88, setting inMigration=false has been delayed, and that's why nil dereference took place in the first place.

@hyeonLewis
Copy link
Contributor Author

hyeonLewis commented Jan 24, 2025

@ian0371 In pre-#88, it doesn't continue on restart since migrationStatusKey: 0, which prevents the node from restarting migration at startup. https://github.com/kaiachain/kaia/blob/dev/storage/database/db_manager.go#L595

So the call flow of pre-88 is:

  1. Set migrationStatusKey: 0 before deleting old db
  2. (Somehow) Node stopped while deleting old statetrie DB
  3. Restart the node
  4. inMigration = false since migrationStatusKey: 0
  5. restartStateMigration becomes no-op.

I also think it's more appropriate to be pre-88 except for the restart issue, so I'll try to introduce solution (add flag to continue deleting DB) in a separate PR.

@hyeonLewis hyeonLewis merged commit b959f9d into kaiachain:dev Jan 26, 2025
11 checks passed
@hyeonLewis hyeonLewis deleted the fix-state-migration branch January 26, 2025 01:49
@github-actions github-actions bot locked and limited conversation to collaborators Jan 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants