Remove named checkpointing #3891

hjoliver · 2020-10-26T20:10:50Z

Supersedes #3864

Long story short:

no one uses it
Cylc 8 reflow achieves the same result better
it complicates the DB and code logic somewhat

So: let's get rid of it.

@MetRonnie - assigning you as you've run into this somewhat in #3863, but someone else can take it on if need be...

To be fully removed - cylc#3891

MetRonnie · 2020-10-27T17:57:32Z

On the Discourse discussion, someone has mentioned named vs automatic checkpoints:

To confirm, this is just for the named checkpoints? Not the checkpointing automatic/recovery capability? We very much use the automatic checkpointing for recovery but not the named checkpoints.

I think I can find 3 types of automatic checkpoints:

restart

cylc-flow/cylc/flow/scheduler.py

Lines 426 to 429 in 5073409

    
           # Take checkpoint and commit immediately so that checkpoint can be 
        
           # copied to the public database. 
        
           pri_dao.take_checkpoints("restart") 
        
           pri_dao.execute_queued_items()

reload-init

cylc-flow/cylc/flow/scheduler.py

Line 986 in 5073409

self.suite_db_mgr.checkpoint("reload-init")

reload-done

cylc-flow/cylc/flow/scheduler.py

Lines 1443 to 1445 in 5073409

    
           if self.pool.do_reload: 
        
               self.pool.reload_taskdefs() 
        
               self.suite_db_mgr.checkpoint("reload-done")

What is to happen to those? Technically they are named checkpoints. If their functionality is to be kept intact, does that mean we can't removed named checkpoints?

MetRonnie · 2020-10-27T18:40:28Z

(Adding this to the 8.0a3 milestone, unless that was not the intent)

hjoliver · 2020-10-27T22:31:43Z

Drop the two automatic reload checkpoints (2 and 3) as well as arbitrary checkpoints taken with cylc checkpoint.

Keep the restart checkpoint, since without that we couldn't restart from the latest state. This gets overwritten whenever anything changes, so there's only ever one of them. Reflow can't replace this (e.g. a proper restart knows about tasks that were running the scheduler went down).

As I recall the "tatest" (restart) checkpoint appears in the checkpoint_id table, but after this it will be the only entry so there's no point in keeping the table. Further, if you look in rundb.py "latest" is always treated as a special case. It points to to the task_pool table rather than task_pool_checkpoints; and suite_params rather than suite_params_checkpoints etc.

So the checkpoint_id table, and the three blah_checkpoints tables, cylc checkpoint, (and all associated code), can be removed.

hjoliver · 2020-10-27T22:32:18Z

(Arrgh, lost control of keyboard! - reopening!)

hjoliver · 2020-10-27T22:46:00Z

@oliver-sanders - I wonder if we should rename the "latest checkpoint" (/"latest_state"/"latest snapshot") tables consistently, to make it obvious what they are and distinguish from the ever-growing run history tables?

Currently:

broadcast_states
suite_params
task_pool

New?

broadcast_latest
suite_param_latest
task_pool_latest

hjoliver · 2020-10-27T23:04:47Z

Hmm, one thing I forgot: we also record a named checkpoint at every restart.

sqlite> select * from checkpoint_id;
0|2020-10-28T11:42:38+13:00|latest
1|2020-10-28T11:42:19+13:00|restart
2|2020-10-28T11:42:26+13:00|restart

I've never heard of anyone wanting to restart from a previous (not latest) restart state with all the broadcasts that happened to be extant at the time. So I suppose it's fine to ditch that. (However, we should warn users about broadcasts in the context of reflows: beware of any reflowed tasks that expect input via broadcasts from other pre-reflow tasks - mind you, the same is already true for warm starts. We should advise to use broadcast sparingly and to know where they are in the flow).

However, suite_params_checkpoints serves as a historical record of global parameters that only change during restart or reload:

sqlite> select * from suite_params;
uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
cylc_version|8.0a3.dev
UTC_mode|0
cycle_point_tz|+1300

sqlite> select * from suite_params_checkpoints;
1|uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
1|cylc_version|8.0a3.dev
1|UTC_mode|0
1|cycle_point_tz|+1300
2|uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
2|cylc_version|8.0a3.dev
2|UTC_mode|0
2|cycle_point_tz|+1300

Perhaps we should replace this with an ever-growing suite_params history table?

On the other hand, the DB does not track changes to everything in the flow config, so maybe it doesn't matter.

oliver-sanders · 2020-10-28T09:48:00Z

It would be nice from a provenance perspective to maintain this information, however, I don't think it's a biggie, can tackle later on if required.

To be fully removed - cylc#3891

MetRonnie · 2020-10-29T17:32:25Z

However, suite_params_checkpoints serves as a historical record of global parameters that only change during restart or reload

Am I right in thinking only the cylc_version can change on a restart, and that actually none of them could be different between reloads?

Update: Ah wait, the initial cycle point etc could change

oliver-sanders · 2020-10-29T17:55:20Z

Yep, also the user can specify different suite variables (e.g -s FOO=BAR)

MetRonnie · 2020-10-30T17:27:02Z

I've put up these changes: https://github.com/cylc/cylc-flow/pull/3906/files/8b1f2956de934b1d2996e3e2310c7274818c18da..0eb2b03cff3fc53822ba4293bfe4fbcbc582e9a5

So far I haven't removed the checkpoint_id and suite_params_checkpoints tables - was it decided to remove them or not?

hjoliver · 2020-10-30T20:05:07Z

Yes remove them. The suite_params aren’t very useful as-is (particularly without corresponding state checkpoints) and as Oliver noted we can come back to a better provenance solution later if needed.

MetRonnie · 2020-10-30T20:22:36Z

What about keeping track of the number of restarts? Could we just have a table that consists of 1 cell that keeps the count?

hjoliver · 2020-10-30T20:28:38Z

Do we need to keep track of that?

Maybe (but not on this PR) a new “suite events” table that keeps a history of all restarts and interventions ...

MetRonnie · 2020-10-30T20:33:59Z

The number of the restart gets printed to the log on every restart. Can either comment that out or get rid of it

hjoliver · 2020-10-30T20:39:03Z

Ok, I forgot about that. Seems to me we could get rid of it now, in lieu of a better provenance system in future. But maybe do it as a clean commit that can be backed out if others disagree! (Although I can’t think of any real use for that info at the moment)

oliver-sanders · 2020-11-02T09:24:49Z

It's quite nice when you are trawling through the logs of some massive suite, or one where the user has repeatably restarted it in the hope that they might get a different outcome.

Could record the re(start) number as a suite parameter?

hjoliver assigned MetRonnie Oct 26, 2020

MetRonnie added a commit to MetRonnie/cylc-flow that referenced this issue Oct 26, 2020

Disable part of checkpoint test

eccaf72

To be fully removed - cylc#3891

oliver-sanders mentioned this issue Oct 27, 2020

cli: tidy #3893

Closed

MetRonnie added this to the cylc-8.0a3 milestone Oct 27, 2020

hjoliver closed this as completed Oct 27, 2020

hjoliver reopened this Oct 27, 2020

MetRonnie added a commit to MetRonnie/cylc-flow that referenced this issue Oct 28, 2020

Disable part of checkpoint test

8b1f295

To be fully removed - cylc#3891

oliver-sanders mentioned this issue Oct 29, 2020

Rationalize the CLI command set. #1249

Closed

9 tasks

MetRonnie mentioned this issue Oct 30, 2020

Remove checkpointing #3906

Merged

8 tasks

hjoliver mentioned this issue Nov 5, 2020

7-to-8 migration guide cylc/cylc-doc#170

Closed

20 tasks

oliver-sanders closed this as completed in #3906 Nov 5, 2020

hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove named checkpointing #3891

Remove named checkpointing #3891

hjoliver commented Oct 26, 2020

MetRonnie commented Oct 27, 2020

MetRonnie commented Oct 27, 2020 •

edited

Loading

hjoliver commented Oct 27, 2020 •

edited

Loading

hjoliver commented Oct 27, 2020

hjoliver commented Oct 27, 2020

hjoliver commented Oct 27, 2020

oliver-sanders commented Oct 28, 2020

MetRonnie commented Oct 29, 2020 •

edited

Loading

oliver-sanders commented Oct 29, 2020

MetRonnie commented Oct 30, 2020

hjoliver commented Oct 30, 2020

MetRonnie commented Oct 30, 2020

hjoliver commented Oct 30, 2020

MetRonnie commented Oct 30, 2020

hjoliver commented Oct 30, 2020 •

edited

Loading

oliver-sanders commented Nov 2, 2020

Remove named checkpointing #3891

Remove named checkpointing #3891

Comments

hjoliver commented Oct 26, 2020

MetRonnie commented Oct 27, 2020

MetRonnie commented Oct 27, 2020 • edited Loading

hjoliver commented Oct 27, 2020 • edited Loading

hjoliver commented Oct 27, 2020

hjoliver commented Oct 27, 2020

hjoliver commented Oct 27, 2020

oliver-sanders commented Oct 28, 2020

MetRonnie commented Oct 29, 2020 • edited Loading

oliver-sanders commented Oct 29, 2020

MetRonnie commented Oct 30, 2020

hjoliver commented Oct 30, 2020

MetRonnie commented Oct 30, 2020

hjoliver commented Oct 30, 2020

MetRonnie commented Oct 30, 2020

hjoliver commented Oct 30, 2020 • edited Loading

oliver-sanders commented Nov 2, 2020

MetRonnie commented Oct 27, 2020 •

edited

Loading

hjoliver commented Oct 27, 2020 •

edited

Loading

MetRonnie commented Oct 29, 2020 •

edited

Loading

hjoliver commented Oct 30, 2020 •

edited

Loading