53 save model state customization #100

vanlankveldthijs · 2024-07-31T13:17:48Z

Resolves #53, #66 and #68

Adds option to also save model state at pre-specified time steps (milestones) and to restart from a stored milestone.

Adds tests for saving milestone and restarting from milestone.

Also has some related minor implementations:

moved savestate param into config and renamed checkpoint_period
storing config by step count as well (for tracking policy changes)
added code version ID (specified commit hash) to save states
moved restart param from model init to run method
changes restart param to accept boolean (restart from last checkpoint), int (restart from milestone), or tuple of ints (restart from milestone at duplicated step)

When initializing a model from a checkpoint or milestone, the file with step count will store updated config parameters. The file without step count will always store the latest config parameters.

The version ID is the git commit hash. This script can either be set to run after every checkout, or (more likely) on every merge to the development branch.

On loading a model, the stored code version is compared to the current code version; if they differ and warning is logged.

…he code version file.

vmgaribay · 2024-08-01T07:46:25Z

Hey, @vanlankveldthijs! (1) Do I understand that, with this update, to turn the model state saving feature off, you set the checkpoint period variable to 0? (2) Or will doing that cause problems with self.step_count % self.checkpoint_period?

If the answer to 1 is no and/or 2 is yes:
I have so far been setting savestate to a number higher than the step target on occasions when I don't want to save the whole model. It would be really nice if this new param checkpoint_period could be made Optional[int]=10 and there was a case to handle None.
else: cool!

vanlankveldthijs · 2024-08-01T09:13:45Z

Hey, @vanlankveldthijs! (1) Do I understand that, with this update, to turn the model state saving feature off, you set the checkpoint period variable to 0? (2) Or will doing that cause problems with self.step_count % self.checkpoint_period?

If the answer to 1 is no and/or 2 is yes: I have so far been setting savestate to a number higher than the step target on occasions when I don't want to save the whole model. It would be really nice if this new param checkpoint_period could be made Optional[int]=10 and there was a case to handle None. else: cool!

Correct, if the checkpoint_period is set to any non-positive value (either 0 or a negative value) no checkpoints are saved.

If you prefer, I could still make the parameter optional, so None would have the same effect as a non-positive value.

However, I would not recommend actually running without ever saving a checkpoint when running the simulation for any extended period of time. If the simulation ever crashes for whatever reason, you wouldn't want to have to redo hours of processing if you could've saved every 100 or 1000 steps.

…llection_and_save_model_state

…TestPtmStep.

vanlankveldthijs · 2024-08-01T11:52:01Z

Some of the 'minor implementations' also resolve #52

…l_state

…llection_and_save_model_state

vanlankveldthijs · 2024-08-06T09:39:42Z

@vmgaribay why did you merge this into development? It has not been reviewed at all.

SarahAlidoost · 2024-08-07T13:58:56Z

dgl_ptm/dgl_ptm/model/initialize_model.py

+        # save updated config to yaml files
        cfg_filename = parent_dir / f'{self._model_identifier}.yaml'
        cfg.to_yaml(cfg_filename)
-        logger.warning(f'The model parameters are saved to {cfg_filename}.')
+        cfg_filename_step = parent_dir / f'{self._model_identifier}_{self.step_count}.yaml'
+        cfg.to_yaml(cfg_filename_step)
+        logger.warning(f'The model parameters are saved to {cfg_filename} and {cfg_filename_step}.')


It is not clear why the configurations (cfg) are stored twice, i.e. cfg_filename and cfg_filename_step. The function set_model_parameters is usually run once at the beginning when step_count is 0. Also, tests are not added in test_model.py::test_set_model_parameters to verify these changes.

Based on our discussion, I will

remove the part where it stores in cfg_filename, only storing in cfg_filename_step.

when continuing from a stored model, find the config file that has the highest step count that is not over the step where the process will continue.

add tests to test all this functionality

I will also see if I can migrate the config parameters into a member of the PovertyTrapModel, as opposed to copying the values of the config.
Copying the values means we cannot just save the model parameters at will, because we might have lost some (non-copied) parameters. Migrating to a Config member of PovertyTrapModel would mean we can save the complete config at will (and I can move saving the config into a separate method).

As of commit bfbb990 these features have been implemented, together with some tests.

TODO: add 'make unique' functionality for when restarting multiple runs from the same step.

SarahAlidoost · 2024-08-07T14:10:50Z

dgl_ptm/dgl_ptm/model/initialize_model.py

+        elif isinstance(restart, int):
+            self.inputs = _load_model(f'./{self._model_identifier}/milestone_{restart}')
+        elif isinstance(restart, tuple):
+            self.inputs = _load_model(f'./{self._model_identifier}/milestone_{restart[0]}_{restart[1]}')


I couldn't find where in the code the model state is saved in the directory f'./{self._model_identifier}/milestone_{restart[0]}_{restart[1]}'. This is also not tested.

Based on our discussion, I will

add an mkdir command to _save_model so we have control over where any (milestone) state save creates a new directory.

add a test for the milestone_X_Y case

As of commit e98f547 this has been implemented.

SarahAlidoost · 2024-08-07T14:13:55Z

dgl_ptm/dgl_ptm/model/initialize_model.py

+        if an int, the model is run from the first milestone at that step,
+        if a pair of ints, the model is run from that milestone at that step.


It seems that the first case is a special instance of the second case, where the first item is 1 i.e. restart =(1, int). can these two cases be merged to simplify the code?

Based on our discussion, I will merge the int and tuple cases. The argument can then be a boolean (for running from a checkpoint) or a tuple (for running from a milestone).

Thinking about it again, it is slightly unclear what to do with the second value of the tuple. We could allow setting this to None, but then the desired milestone to load might either be

milestone 0 at the specified step count

the most recent milestone at the specified step count. I'm leaning towards this option.

To add some context: the use case where the milestone with step and increment would be important, is if we want to compare different policy choices that run until the same step.
Imagine we have some starting state (let's say at step 500) and we want to compare policy choices A, B, and C, after applying them for the same amount of time (let's say for 500 steps). So we'd restart the model 3 times from a milestone at step 500 (each with different config parameters representing the different policies) and run until step 1000 3 times and store the final state. In this case, we must be able to store multiple milestones at the same time step (1000).

TODO: add logging message to state the milestone that will be used.
TODO: only keep the tuple case; remove the int case.

SarahAlidoost · 2024-08-07T14:20:31Z

dgl_ptm/dgl_ptm/model/initialize_model.py

+        run the model for each step until the step_target is reached.
+
+        param: restart: boolean or int or a pair of ints, optional.
+        If True, the model is run from last checkpoint,


We should be able to identify the "last checkpoint" by a "milestone" and "step" as (milestone, step), correct? This can be merged with the third case to simplify the code.

Based on our discussion, there was some confusion here. The tuple case would be (step, instance), not (milestone, step). I'll try to clarify this in the doc string.

SarahAlidoost · 2024-08-07T14:32:22Z

dgl_ptm/dgl_ptm/model/initialize_model.py

+                    _save_model(f'./{self._model_identifier}', self.inputs)
+                if save_milestone:
+                    milestone_path = _make_path_unique(f'./{self._model_identifier}/milestone_{self.step_count}')
+                    _save_model(milestone_path, self.inputs)


I couldn't find in the code where this directory milestone_path is actually created, and how the permission issue is handled.

As stated above, I will add an mkdir command to _save_model so we have control over where any (milestone) state save creates a new directory.

SarahAlidoost · 2024-08-07T14:39:48Z

dgl_ptm/dgl_ptm/model/initialize_model.py

+def _make_path_unique(path):
+    if Path(path).exists():
+        incr = 1
+        def add_incr(path, incr): return f'{path}_{incr}'
+        while Path(add_incr(path, incr)).exists(): incr += 1
+        path = add_incr(path, incr)
+    return path


Adding a number to the path isnot informative. I suggest generating unique directories using timestamp as:

from datetime import datetime now = datetime.now() timestamp = now.strftime("%Y%m%d_%H%M%S") unique_path = f"{path}_{timestamp}"

While this would make it more straightforward to make a unique path, I think this will also make it harder to identify the milestone when trying to load it. For example, if you're trying to continue from 'the second milestone' you would have to know (or look up) which timestamp this milestone has.

This might still be the best approach, or we could combine the concepts by storing milestones as {path}_{step}_{timestamp}.

Note that if we have any method for creating unique paths, we can always get the time at which that path was created from the OS.

TODO: keep instance numbering instead of the timestamp
TODO: add docstring to _make_path_unique
TODO: rename incr to instance.

TODO: add (str) description to config file. Not to be used by the program, but just for the researcher to describe the run or policy to test. Add comment to describe this value.

SarahAlidoost · 2024-08-07T14:56:49Z

dgl_ptm/dgl_ptm/model/initialize_model.py

+        # Code version.
+        self.version = Path('version.md').read_text().splitlines()[0]


I like the idea of saving the software version along with the model output. but how practical is it to use git commit hash? It is not clear how this works in practice. For example, who runs the regen_version.sh file to store the commit hash, and how often is it run? Do the users use git to run experiments? Instead, we could use the software version specified in pyproject.toml as version = "0.1.0".

Based on our discussion, I will add some guidance for this in the CONTRIBUTING.md file.

In short, the version is stored in a tracked file. This makes it independent of the repository: the version file can be shipped with the built code (outside the repo) and the code would still work. However, tracking the file means it cannot use the 'current commit hash', because committing this version file (with the current hash) will create a newer hash.
Instead, we should run the regen_version.md script before any release (or other 'important code version'). For example as part of the latest commit before tagging for release. It may be possible to do this using git actions, or we may have to do this manually, in which case it should be very clear from the CONTRIBUTING.md

I understand this may not be the ideal approach, but I could not find a better approach :(

I also found that the relative path of Path('version.md') causes problems. Can you please make it absolute?

I changed code version to process version to reflect that this captures other aspects of the process as well.

Tbh, I don't really know how best to get the path of the package root (as opposed to the model save directory). Will discuss.

Tbh, I don't really know how best to get the path of the package root (as opposed to the model save directory). Will discuss.

parent_path = Path(__file__).resolve().parents, parents is a list, see doc

SarahAlidoost · 2024-08-07T15:02:27Z

dgl_ptm/tests/test_model.py

+        model.step_target = 3 # only run the model till step 3
+        model.run()
+        expected_generator_state = set(model.inputs["generator_state"].tolist())
+


can we add some checks here before restarting the model?

Which checks do you think are needed?

TODO: check if the correct config file is there and whether the model is saved. Before the second run.

SarahAlidoost · 2024-08-07T15:04:30Z

dgl_ptm/regen_version.sh

@@ -0,0 +1,3 @@
+#!/bin/bash


can we add some documentation in this file or in a md file in a docs folder?

The current md file in the docs folder is about running on snellius. I think it's better to add some documentation to the CONTRIBUTING.md file.

SarahAlidoost · 2024-08-07T15:06:18Z

dgl_ptm/dgl_ptm/config.py

+    checkpoint_period: int = 10
+    milestones: Optional[List[PositiveInt]] = None


can we add documentation/definitions for these two parameters somewhere?

SarahAlidoost · 2024-08-07T15:08:32Z

dgl_ptm/tests/test_model.py

+        assert model.inputs["step_count"] == 1
+        assert model.step_count == 5
+        assert stored_generator_state == expected_generator_state 
+


Tests are missing for the case isinstance(restart, tuple) e.g. model.run(restart=(2, 3)).

SarahAlidoost

@vanlankveldthijs nice work on adding the functionality 👍 I added some comments on the changed files, mainly about tests. Perhaps the tests can be added in a follow-up pull request.

vanlankveldthijs · 2024-08-15T12:44:26Z

All comments should now be processed. Pull request can be merged when reviewers agree.

meiertgrootes

Nothing to add beyond our discussion

thijsvl and others added 14 commits July 25, 2024 15:56

Fixed some imports, corrected some typos, and added some comments.

b1d2e25

Moved savestate parameter into config and renamed to checkpoint_period.

5345109

Ignoring temporary test forlders.

a1cc3ba

Storing config to both modelId.yaml and modelId_step.yaml

d3da61c

When initializing a model from a checkpoint or milestone, the file with step count will store updated config parameters. The file without step count will always store the latest config parameters.

Added script to generate a version ID file.

867c615

The version ID is the git commit hash. This script can either be set to run after every checkout, or (more likely) on every merge to the development branch.

Added code version file when saving model.

2147dfa

On loading a model, the stored code version is compared to the current code version; if they differ and warning is logged.

Moved restart parameter from model initialization to run method.

61e5ef2

Storing step count with code version on save.

4b74d01

Added option to save model at predefined milestone steps.

43504f0

Fixed test_model_init_savestate test to also check the existence of t…

4a2e338

…he code version file.

Added test for saving milestone.

370bf8f

Added option to restart from milestone.

9bbaf3d

Added test for restarting from milestone.

3add841

Merge branch 'development' into 53_save_model_state_customization

994192c

vanlankveldthijs requested review from meiertgrootes and SarahAlidoost August 1, 2024 08:04

thijsvl added 2 commits August 1, 2024 13:12

Merge branch '53_save_model_state_customization' into merging_data_co…

60a84a6

…llection_and_save_model_state

Moved test_data_collection_period and test_data_collection_list into …

458b24b

…TestPtmStep.

vanlankveldthijs force-pushed the 53_save_model_state_customization branch from 3faa497 to 994192c Compare August 5, 2024 12:16

thijsvl and others added 4 commits August 5, 2024 15:30

Merge branch 'development' into 53_save_model_state_customization

c212708

Merge branch 'development' into 53_save_model_state_customization

11355ff

Merge branch 'development' into merging_data_collection_and_save_mode…

a2fbd61

…l_state

Merge branch '53_save_model_state_customization' into merging_data_co…

d7a7df3

…llection_and_save_model_state

vmgaribay merged commit 81c6574 into development Aug 5, 2024
4 checks passed

SarahAlidoost reviewed Aug 7, 2024

View reviewed changes

meiertgrootes reviewed Aug 19, 2024

View reviewed changes

meiertgrootes mentioned this pull request Aug 20, 2024

Add possibility of burn in phase #65

Closed

		if an int, the model is run from the first milestone at that step,
		if a pair of ints, the model is run from that milestone at that step.

		# Code version.
		self.version = Path('version.md').read_text().splitlines()[0]

		checkpoint_period: int = 10
		milestones: Optional[List[PositiveInt]] = None

53 save model state customization #100

53 save model state customization #100

Conversation

vanlankveldthijs commented Jul 31, 2024

vmgaribay commented Aug 1, 2024

vanlankveldthijs commented Aug 1, 2024

vanlankveldthijs commented Aug 1, 2024

vanlankveldthijs commented Aug 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanlankveldthijs Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanlankveldthijs Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SarahAlidoost left a comment

Choose a reason for hiding this comment

vanlankveldthijs commented Aug 15, 2024

meiertgrootes left a comment

Choose a reason for hiding this comment

vanlankveldthijs Aug 8, 2024 •

edited

Loading

vanlankveldthijs Aug 8, 2024 •

edited

Loading