Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached state will not exist for persisted check after a reboot #414

Open
flotter opened this issue Apr 28, 2024 · 3 comments
Open

Cached state will not exist for persisted check after a reboot #414

flotter opened this issue Apr 28, 2024 · 3 comments
Assignees
Labels
25.04 Planned for the 25.04 cycle

Comments

@flotter
Copy link
Contributor

flotter commented Apr 28, 2024

[unrecovered-panic] runtime.fatalpanic() .snap/go/10584/src/runtime/panic.go:1188 (hits goroutine(1):1 total:1) (PC: 0x43df84)
Warning: debugging optimized function
	runtime.curg._panic.arg: interface {}(string) "interface conversion: interface {} is nil, not *plan.Check"
(dlv) bt
 0  0x000000000043df84 in runtime.fatalpanic
    at .snap/go/10584/src/runtime/panic.go:1188
 1  0x000000000043d759 in runtime.gopanic
    at .snap/go/10584/src/runtime/panic.go:1017
 2  0x0000000000ac2d7c in github.com/canonical/pebble/internals/cli.Run.func1
    at .root/go/pkg/mod/github.com/canonical/[email protected]/internals/cli/cli.go:298
 3  0x000000000043d610 in runtime.gopanic
    at .snap/go/10584/src/runtime/panic.go:920
 4  0x000000000040c685 in runtime.panicdottypeE
    at .snap/go/10584/src/runtime/iface.go:263
 5  0x0000000000994ce5 in github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged
    at .root/go/pkg/mod/github.com/canonical/[email protected]/internals/overlord/checkstate/manager.go:125
 6  0x00000000009e980f in github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged-fm
    at <autogenerated>:1
 7  0x00000000009ce9fc in github.com/canonical/pebble/internals/overlord/planstate.(*PlanManager).planChanged
    at .root/go/pkg/mod/github.com/canonical/[email protected]/internals/overlord/planstate/manager.go:87
 8  0x00000000009ce690 in github.com/canonical/pebble/internals/overlord/planstate.(*PlanManager).Load
    at .root/go/pkg/mod/github.com/canonical/[email protected]/internals/overlord/planstate/manager.go:67
 9  0x00000000009e5949 in github.com/canonical/pebble/internals/overlord.New
    at .root/go/pkg/mod/github.com/canonical/[email protected]/internals/overlord/overlord.go:203
10  0x0000000000a0b4c8 in github.com/canonical/pebble/internals/daemon.New
    at .root/go/pkg/mod/github.com/canonical/[email protected]/internals/daemon/daemon.go:786
:
13  0x00000000005866d2 in github.com/canonical/go-flags.(*Parser).ParseArgs
    at .root/go/pkg/mod/github.com/canonical/[email protected]/parser.go:335
14  0x00000000005854c5 in github.com/canonical/go-flags.(*Parser).Parse
    at .root/go/pkg/mod/github.com/canonical/[email protected]/parser.go:191
15  0x0000000000a9f78b in github.com/canonical/pebble/internals/cli.Run
    at .root/go/pkg/mod/github.com/canonical/[email protected]/internals/cli/cli.go:327
16  0x0000000000acd97f in main.main
    at .home/flotter/XXXXX/cmd/XXXXX/main.go:90
17  0x0000000000440327 in runtime.main
    at .snap/go/10584/src/runtime/proc.go:267
18  0x00000000004717a1 in runtime.goexit
    at .snap/go/10584/src/runtime/asm_amd64.s:1650
   120:				if change.Kind() == performCheckKind {
   121:					configKey = performConfigKey{change.ID()}
   122:				} else {
   123:					configKey = recoverConfigKey{change.ID()}
   124:				}
=> 125:				oldConfig := m.state.Cached(configKey).(*plan.Check) // panic if key not present (always should be)
   126:				existingChecks[oldConfig.Name] = true
   127:	
   128:				newConfig, inNew := newPlan.Checks[details.Name]
   129:				if inNew {
   130:					merged := mergeServiceContext(newPlan, newConfig)
@flotter
Copy link
Contributor Author

flotter commented Apr 28, 2024

Discussed with @benhoyt and will propose a couple of PRs soon.

  1. Plan Manager is notifying all managers of a plan change before the state engine has started (overlord loop). This means managers have no opportunity to remove leftover persisted tasks during manager StartUp before the manager's change handler starts processing the new plan.

  2. Gracefully handle missing Cached lookups, as this almost certainly means a leftover check from the state engine started running after a reboot. In this case we likely do not have any undo tasks associated with the handlers in question, so we could simply call Abort on them.

@flotter
Copy link
Contributor Author

flotter commented Apr 29, 2024

This could be fix for this issue identified above, but I would still like to look at improving the Plan Manager so that it does not propagate a plan change before managers can really cope with it (before the state engine is ready).

#415

benhoyt pushed a commit that referenced this issue May 2, 2024
Addresses #414.

Running checks have changes that get persisted by the state-engine. This
means that following a reboot, changes and tasks that are not complete
(not ready) will be resumed. In the case of some managers, such as
checkstate, carryover changes and tasks are an unwanted side effect.

Currently the plan manager will perform the first plan load very early
during startup (before the state-engine is ready, and before StartUp
hooks have been called). The result is that a PlanChanged propagation
will take place at plan load, and force checkstate to inspect the
running checks prematurely.

Checkstate discovers a running change from a previous boot context, and
tries to load its data from cached state, which does not exist.

Gracefully ignore changes for which cached state does not exist, and use
this to identify changes that should be aborted on the first ensure
pass.

Test:

```
<reboot>
59   Hold    today at 23:29 SAST  today at 23:37 SAST  Recover exec check "internet-online"
60   Error   today at 23:37 SAST  today at 23:37 SAST  Perform exec check "internet-online"
:
66   Doing   today at 23:37 SAST  -                    Recover exec check "internet-online"
<reboot>
:
66   Hold    today at 23:37 SAST  today at 23:41 SAST  Recover exec check "internet-online"
:
69   Error   today at 23:41 SAST  today at 23:42 SAST  Perform exec check "internet-online"
:
73   Doing   today at 23:42 SAST  -                    Recover exec check "internet-online"
```
As the test demonstrates, following a reboot, the change (66) is
aborted, and as a result the status is now ```Hold```. A new change is
created which starts as a ```Perform```, and then fails (as expected)
after a while and changes to a ```Recover```.
@benhoyt benhoyt self-assigned this May 17, 2024
@benhoyt benhoyt added the 24.10 label May 17, 2024
@benhoyt benhoyt added 25.04 Planned for the 25.04 cycle and removed 24.10 labels Nov 7, 2024
@benhoyt
Copy link
Contributor

benhoyt commented Nov 7, 2024

Just copying from my personal notes. For a more holistic fix:

  • maybe move config to task (but delete it when tasks are done)
  • think about hold/continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
25.04 Planned for the 25.04 cycle
Projects
None yet
Development

No branches or pull requests

2 participants