Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove named checkpointing #3891

Closed
hjoliver opened this issue Oct 26, 2020 · 16 comments · Fixed by #3906
Closed

Remove named checkpointing #3891

hjoliver opened this issue Oct 26, 2020 · 16 comments · Fixed by #3906
Assignees
Milestone

Comments

@hjoliver
Copy link
Member

Supersedes #3864

Long story short:

  • no one uses it
  • Cylc 8 reflow achieves the same result better
  • it complicates the DB and code logic somewhat

So: let's get rid of it.

@MetRonnie - assigning you as you've run into this somewhat in #3863, but someone else can take it on if need be...

MetRonnie added a commit to MetRonnie/cylc-flow that referenced this issue Oct 26, 2020
@MetRonnie
Copy link
Member

On the Discourse discussion, someone has mentioned named vs automatic checkpoints:

To confirm, this is just for the named checkpoints? Not the checkpointing automatic/recovery capability? We very much use the automatic checkpointing for recovery but not the named checkpoints.

I think I can find 3 types of automatic checkpoints:

  1. restart
    # Take checkpoint and commit immediately so that checkpoint can be
    # copied to the public database.
    pri_dao.take_checkpoints("restart")
    pri_dao.execute_queued_items()
  2. reload-init
    self.suite_db_mgr.checkpoint("reload-init")
  3. reload-done
    if self.pool.do_reload:
    self.pool.reload_taskdefs()
    self.suite_db_mgr.checkpoint("reload-done")

What is to happen to those? Technically they are named checkpoints. If their functionality is to be kept intact, does that mean we can't removed named checkpoints?

@MetRonnie MetRonnie added this to the cylc-8.0a3 milestone Oct 27, 2020
@MetRonnie
Copy link
Member

MetRonnie commented Oct 27, 2020

(Adding this to the 8.0a3 milestone, unless that was not the intent)

@hjoliver
Copy link
Member Author

hjoliver commented Oct 27, 2020

Drop the two automatic reload checkpoints (2 and 3) as well as arbitrary checkpoints taken with cylc checkpoint.

Keep the restart checkpoint, since without that we couldn't restart from the latest state. This gets overwritten whenever anything changes, so there's only ever one of them. Reflow can't replace this (e.g. a proper restart knows about tasks that were running the scheduler went down).

As I recall the "tatest" (restart) checkpoint appears in the checkpoint_id table, but after this it will be the only entry so there's no point in keeping the table. Further, if you look in rundb.py "latest" is always treated as a special case. It points to to the task_pool table rather than task_pool_checkpoints; and suite_params rather than suite_params_checkpoints etc.

So the checkpoint_id table, and the three blah_checkpoints tables, cylc checkpoint, (and all associated code), can be removed.

@hjoliver
Copy link
Member Author

(Arrgh, lost control of keyboard! - reopening!)

@hjoliver hjoliver reopened this Oct 27, 2020
@hjoliver
Copy link
Member Author

@oliver-sanders - I wonder if we should rename the "latest checkpoint" (/"latest_state"/"latest snapshot") tables consistently, to make it obvious what they are and distinguish from the ever-growing run history tables?

Currently:

  • broadcast_states
  • suite_params
  • task_pool

New?

  • broadcast_latest
  • suite_param_latest
  • task_pool_latest

@hjoliver
Copy link
Member Author

Hmm, one thing I forgot: we also record a named checkpoint at every restart.

sqlite> select * from checkpoint_id;
0|2020-10-28T11:42:38+13:00|latest
1|2020-10-28T11:42:19+13:00|restart
2|2020-10-28T11:42:26+13:00|restart

I've never heard of anyone wanting to restart from a previous (not latest) restart state with all the broadcasts that happened to be extant at the time. So I suppose it's fine to ditch that. (However, we should warn users about broadcasts in the context of reflows: beware of any reflowed tasks that expect input via broadcasts from other pre-reflow tasks - mind you, the same is already true for warm starts. We should advise to use broadcast sparingly and to know where they are in the flow).

However, suite_params_checkpoints serves as a historical record of global parameters that only change during restart or reload:

sqlite> select * from suite_params;
uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
cylc_version|8.0a3.dev
UTC_mode|0
cycle_point_tz|+1300

sqlite> select * from suite_params_checkpoints;
1|uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
1|cylc_version|8.0a3.dev
1|UTC_mode|0
1|cycle_point_tz|+1300
2|uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
2|cylc_version|8.0a3.dev
2|UTC_mode|0
2|cycle_point_tz|+1300

Perhaps we should replace this with an ever-growing suite_params history table?

On the other hand, the DB does not track changes to everything in the flow config, so maybe it doesn't matter.

@oliver-sanders
Copy link
Member

It would be nice from a provenance perspective to maintain this information, however, I don't think it's a biggie, can tackle later on if required.

MetRonnie added a commit to MetRonnie/cylc-flow that referenced this issue Oct 28, 2020
@MetRonnie
Copy link
Member

MetRonnie commented Oct 29, 2020

However, suite_params_checkpoints serves as a historical record of global parameters that only change during restart or reload

Am I right in thinking only the cylc_version can change on a restart, and that actually none of them could be different between reloads?

Update: Ah wait, the initial cycle point etc could change

@oliver-sanders
Copy link
Member

Yep, also the user can specify different suite variables (e.g -s FOO=BAR)

@MetRonnie
Copy link
Member

I've put up these changes: https://github.com/cylc/cylc-flow/pull/3906/files/8b1f2956de934b1d2996e3e2310c7274818c18da..0eb2b03cff3fc53822ba4293bfe4fbcbc582e9a5

So far I haven't removed the checkpoint_id and suite_params_checkpoints tables - was it decided to remove them or not?

@hjoliver
Copy link
Member Author

Yes remove them. The suite_params aren’t very useful as-is (particularly without corresponding state checkpoints) and as Oliver noted we can come back to a better provenance solution later if needed.

@MetRonnie
Copy link
Member

What about keeping track of the number of restarts? Could we just have a table that consists of 1 cell that keeps the count?

@hjoliver
Copy link
Member Author

Do we need to keep track of that?

Maybe (but not on this PR) a new “suite events” table that keeps a history of all restarts and interventions ...

@MetRonnie
Copy link
Member

The number of the restart gets printed to the log on every restart. Can either comment that out or get rid of it

@hjoliver
Copy link
Member Author

hjoliver commented Oct 30, 2020

Ok, I forgot about that. Seems to me we could get rid of it now, in lieu of a better provenance system in future. But maybe do it as a clean commit that can be backed out if others disagree! (Although I can’t think of any real use for that info at the moment)

@oliver-sanders
Copy link
Member

It's quite nice when you are trawling through the logs of some massive suite, or one where the user has repeatably restarted it in the hope that they might get a different outcome.

Could record the re(start) number as a suite parameter?

@hjoliver hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants