Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

53 save model state customization #100

Merged
merged 20 commits into from
Aug 5, 2024

Conversation

vanlankveldthijs
Copy link

Resolves #53, #66 and #68

Adds option to also save model state at pre-specified time steps (milestones) and to restart from a stored milestone.

Adds tests for saving milestone and restarting from milestone.

Also has some related minor implementations:

  • moved savestate param into config and renamed checkpoint_period
  • storing config by step count as well (for tracking policy changes)
  • added code version ID (specified commit hash) to save states
  • moved restart param from model init to run method
  • changes restart param to accept boolean (restart from last checkpoint), int (restart from milestone), or tuple of ints (restart from milestone at duplicated step)

thijsvl and others added 14 commits July 25, 2024 15:56
When initializing a model from a checkpoint or milestone, the file
with step count will store updated config parameters.

The file without step count will always store the latest config
parameters.
The version ID is the git commit hash.

This script can either be set to run after every checkout,
or (more likely) on every merge to the development branch.
On loading a model, the stored code version is compared to the
current code version; if they differ and warning is logged.
@vmgaribay
Copy link

Hey, @vanlankveldthijs! (1) Do I understand that, with this update, to turn the model state saving feature off, you set the checkpoint period variable to 0? (2) Or will doing that cause problems with self.step_count % self.checkpoint_period?

If the answer to 1 is no and/or 2 is yes:
I have so far been setting savestate to a number higher than the step target on occasions when I don't want to save the whole model. It would be really nice if this new param checkpoint_period could be made Optional[int]=10 and there was a case to handle None.
else: cool!

@vanlankveldthijs
Copy link
Author

Hey, @vanlankveldthijs! (1) Do I understand that, with this update, to turn the model state saving feature off, you set the checkpoint period variable to 0? (2) Or will doing that cause problems with self.step_count % self.checkpoint_period?

If the answer to 1 is no and/or 2 is yes: I have so far been setting savestate to a number higher than the step target on occasions when I don't want to save the whole model. It would be really nice if this new param checkpoint_period could be made Optional[int]=10 and there was a case to handle None. else: cool!

Correct, if the checkpoint_period is set to any non-positive value (either 0 or a negative value) no checkpoints are saved.

If you prefer, I could still make the parameter optional, so None would have the same effect as a non-positive value.

However, I would not recommend actually running without ever saving a checkpoint when running the simulation for any extended period of time. If the simulation ever crashes for whatever reason, you wouldn't want to have to redo hours of processing if you could've saved every 100 or 1000 steps.

@vanlankveldthijs
Copy link
Author

Some of the 'minor implementations' also resolve #52

@vanlankveldthijs vanlankveldthijs force-pushed the 53_save_model_state_customization branch from 3faa497 to 994192c Compare August 5, 2024 12:16
@vmgaribay vmgaribay merged commit 81c6574 into development Aug 5, 2024
4 checks passed
@vanlankveldthijs
Copy link
Author

@vmgaribay why did you merge this into development? It has not been reviewed at all.

Comment on lines +223 to +228
# save updated config to yaml files
cfg_filename = parent_dir / f'{self._model_identifier}.yaml'
cfg.to_yaml(cfg_filename)
logger.warning(f'The model parameters are saved to {cfg_filename}.')
cfg_filename_step = parent_dir / f'{self._model_identifier}_{self.step_count}.yaml'
cfg.to_yaml(cfg_filename_step)
logger.warning(f'The model parameters are saved to {cfg_filename} and {cfg_filename_step}.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear why the configurations (cfg) are stored twice, i.e. cfg_filename and cfg_filename_step. The function set_model_parameters is usually run once at the beginning when step_count is 0. Also, tests are not added in test_model.py::test_set_model_parameters to verify these changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion, I will

  • remove the part where it stores in cfg_filename, only storing in cfg_filename_step.
  • when continuing from a stored model, find the config file that has the highest step count that is not over the step where the process will continue.
  • add tests to test all this functionality

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also see if I can migrate the config parameters into a member of the PovertyTrapModel, as opposed to copying the values of the config.
Copying the values means we cannot just save the model parameters at will, because we might have lost some (non-copied) parameters. Migrating to a Config member of PovertyTrapModel would mean we can save the complete config at will (and I can move saving the config into a separate method).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of commit bfbb990 these features have been implemented, together with some tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add 'make unique' functionality for when restarting multiple runs from the same step.

elif isinstance(restart, int):
self.inputs = _load_model(f'./{self._model_identifier}/milestone_{restart}')
elif isinstance(restart, tuple):
self.inputs = _load_model(f'./{self._model_identifier}/milestone_{restart[0]}_{restart[1]}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find where in the code the model state is saved in the directory f'./{self._model_identifier}/milestone_{restart[0]}_{restart[1]}'. This is also not tested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion, I will

  • add an mkdir command to _save_model so we have control over where any (milestone) state save creates a new directory.
  • add a test for the milestone_X_Y case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of commit e98f547 this has been implemented.

Comment on lines +403 to +404
if an int, the model is run from the first milestone at that step,
if a pair of ints, the model is run from that milestone at that step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the first case is a special instance of the second case, where the first item is 1 i.e. restart =(1, int). can these two cases be merged to simplify the code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion, I will merge the int and tuple cases. The argument can then be a boolean (for running from a checkpoint) or a tuple (for running from a milestone).

Copy link
Author

@vanlankveldthijs vanlankveldthijs Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it again, it is slightly unclear what to do with the second value of the tuple. We could allow setting this to None, but then the desired milestone to load might either be

  • milestone 0 at the specified step count
  • the most recent milestone at the specified step count. I'm leaning towards this option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add some context: the use case where the milestone with step and increment would be important, is if we want to compare different policy choices that run until the same step.
Imagine we have some starting state (let's say at step 500) and we want to compare policy choices A, B, and C, after applying them for the same amount of time (let's say for 500 steps). So we'd restart the model 3 times from a milestone at step 500 (each with different config parameters representing the different policies) and run until step 1000 3 times and store the final state. In this case, we must be able to store multiple milestones at the same time step (1000).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add logging message to state the milestone that will be used.
TODO: only keep the tuple case; remove the int case.

run the model for each step until the step_target is reached.

param: restart: boolean or int or a pair of ints, optional.
If True, the model is run from last checkpoint,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to identify the "last checkpoint" by a "milestone" and "step" as (milestone, step), correct? This can be merged with the third case to simplify the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion, there was some confusion here. The tuple case would be (step, instance), not (milestone, step). I'll try to clarify this in the doc string.

_save_model(f'./{self._model_identifier}', self.inputs)
if save_milestone:
milestone_path = _make_path_unique(f'./{self._model_identifier}/milestone_{self.step_count}')
_save_model(milestone_path, self.inputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find in the code where this directory milestone_path is actually created, and how the permission issue is handled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated above, I will add an mkdir command to _save_model so we have control over where any (milestone) state save creates a new directory.

Comment on lines +450 to +456
def _make_path_unique(path):
if Path(path).exists():
incr = 1
def add_incr(path, incr): return f'{path}_{incr}'
while Path(add_incr(path, incr)).exists(): incr += 1
path = add_incr(path, incr)
return path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a number to the path isnot informative. I suggest generating unique directories using timestamp as:

from datetime import datetime
now = datetime.now()
timestamp = now.strftime("%Y%m%d_%H%M%S")
unique_path = f"{path}_{timestamp}"

Copy link
Author

@vanlankveldthijs vanlankveldthijs Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this would make it more straightforward to make a unique path, I think this will also make it harder to identify the milestone when trying to load it. For example, if you're trying to continue from 'the second milestone' you would have to know (or look up) which timestamp this milestone has.

This might still be the best approach, or we could combine the concepts by storing milestones as {path}_{step}_{timestamp}.

Note that if we have any method for creating unique paths, we can always get the time at which that path was created from the OS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: keep instance numbering instead of the timestamp
TODO: add docstring to _make_path_unique
TODO: rename incr to instance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add (str) description to config file. Not to be used by the program, but just for the researcher to describe the run or policy to test. Add comment to describe this value.

Comment on lines +170 to +171
# Code version.
self.version = Path('version.md').read_text().splitlines()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of saving the software version along with the model output. but how practical is it to use git commit hash? It is not clear how this works in practice. For example, who runs the regen_version.sh file to store the commit hash, and how often is it run? Do the users use git to run experiments? Instead, we could use the software version specified in pyproject.toml as version = "0.1.0".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion, I will add some guidance for this in the CONTRIBUTING.md file.

In short, the version is stored in a tracked file. This makes it independent of the repository: the version file can be shipped with the built code (outside the repo) and the code would still work. However, tracking the file means it cannot use the 'current commit hash', because committing this version file (with the current hash) will create a newer hash.
Instead, we should run the regen_version.md script before any release (or other 'important code version'). For example as part of the latest commit before tagging for release. It may be possible to do this using git actions, or we may have to do this manually, in which case it should be very clear from the CONTRIBUTING.md

I understand this may not be the ideal approach, but I could not find a better approach :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found that the relative path of Path('version.md') causes problems. Can you please make it absolute?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed code version to process version to reflect that this captures other aspects of the process as well.

Tbh, I don't really know how best to get the path of the package root (as opposed to the model save directory). Will discuss.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, I don't really know how best to get the path of the package root (as opposed to the model save directory). Will discuss.

parent_path = Path(__file__).resolve().parents, parents is a list, see doc

model.step_target = 3 # only run the model till step 3
model.run()
expected_generator_state = set(model.inputs["generator_state"].tolist())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some checks here before restarting the model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which checks do you think are needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: check if the correct config file is there and whether the model is saved. Before the second run.

@@ -0,0 +1,3 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some documentation in this file or in a md file in a docs folder?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current md file in the docs folder is about running on snellius. I think it's better to add some documentation to the CONTRIBUTING.md file.

Comment on lines +216 to +217
checkpoint_period: int = 10
milestones: Optional[List[PositiveInt]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add documentation/definitions for these two parameters somewhere?

assert model.inputs["step_count"] == 1
assert model.step_count == 5
assert stored_generator_state == expected_generator_state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are missing for the case isinstance(restart, tuple) e.g. model.run(restart=(2, 3)).

Copy link
Member

@SarahAlidoost SarahAlidoost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanlankveldthijs nice work on adding the functionality 👍 I added some comments on the changed files, mainly about tests. Perhaps the tests can be added in a follow-up pull request.

@vanlankveldthijs
Copy link
Author

All comments should now be processed. Pull request can be merged when reviewers agree.

Copy link
Member

@meiertgrootes meiertgrootes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to add beyond our discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants