-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
53 save model state customization #100
Conversation
When initializing a model from a checkpoint or milestone, the file with step count will store updated config parameters. The file without step count will always store the latest config parameters.
The version ID is the git commit hash. This script can either be set to run after every checkout, or (more likely) on every merge to the development branch.
On loading a model, the stored code version is compared to the current code version; if they differ and warning is logged.
…he code version file.
Hey, @vanlankveldthijs! (1) Do I understand that, with this update, to turn the model state saving feature off, you set the checkpoint period variable to 0? (2) Or will doing that cause problems with self.step_count % self.checkpoint_period? If the answer to 1 is no and/or 2 is yes: |
Correct, if the checkpoint_period is set to any non-positive value (either 0 or a negative value) no checkpoints are saved. If you prefer, I could still make the parameter optional, so None would have the same effect as a non-positive value. However, I would not recommend actually running without ever saving a checkpoint when running the simulation for any extended period of time. If the simulation ever crashes for whatever reason, you wouldn't want to have to redo hours of processing if you could've saved every 100 or 1000 steps. |
…llection_and_save_model_state
Some of the 'minor implementations' also resolve #52 |
3faa497
to
994192c
Compare
…llection_and_save_model_state
@vmgaribay why did you merge this into development? It has not been reviewed at all. |
# save updated config to yaml files | ||
cfg_filename = parent_dir / f'{self._model_identifier}.yaml' | ||
cfg.to_yaml(cfg_filename) | ||
logger.warning(f'The model parameters are saved to {cfg_filename}.') | ||
cfg_filename_step = parent_dir / f'{self._model_identifier}_{self.step_count}.yaml' | ||
cfg.to_yaml(cfg_filename_step) | ||
logger.warning(f'The model parameters are saved to {cfg_filename} and {cfg_filename_step}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear why the configurations (cfg
) are stored twice, i.e. cfg_filename
and cfg_filename_step
. The function set_model_parameters
is usually run once at the beginning when step_count
is 0. Also, tests are not added in test_model.py::test_set_model_parameters
to verify these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion, I will
- remove the part where it stores in
cfg_filename
, only storing incfg_filename_step
. - when continuing from a stored model, find the config file that has the highest step count that is not over the step where the process will continue.
- add tests to test all this functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also see if I can migrate the config parameters into a member of the PovertyTrapModel, as opposed to copying the values of the config.
Copying the values means we cannot just save the model parameters at will, because we might have lost some (non-copied) parameters. Migrating to a Config member of PovertyTrapModel would mean we can save the complete config at will (and I can move saving the config into a separate method).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of commit bfbb990 these features have been implemented, together with some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add 'make unique' functionality for when restarting multiple runs from the same step.
elif isinstance(restart, int): | ||
self.inputs = _load_model(f'./{self._model_identifier}/milestone_{restart}') | ||
elif isinstance(restart, tuple): | ||
self.inputs = _load_model(f'./{self._model_identifier}/milestone_{restart[0]}_{restart[1]}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find where in the code the model state is saved in the directory f'./{self._model_identifier}/milestone_{restart[0]}_{restart[1]}'
. This is also not tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion, I will
- add an mkdir command to _save_model so we have control over where any (milestone) state save creates a new directory.
- add a test for the milestone_X_Y case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of commit e98f547 this has been implemented.
if an int, the model is run from the first milestone at that step, | ||
if a pair of ints, the model is run from that milestone at that step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the first case is a special instance of the second case, where the first item is 1 i.e. restart =(1, int). can these two cases be merged to simplify the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion, I will merge the int and tuple cases. The argument can then be a boolean (for running from a checkpoint) or a tuple (for running from a milestone).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it again, it is slightly unclear what to do with the second value of the tuple. We could allow setting this to None, but then the desired milestone to load might either be
- milestone 0 at the specified step count
- the most recent milestone at the specified step count. I'm leaning towards this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add some context: the use case where the milestone with step and increment would be important, is if we want to compare different policy choices that run until the same step.
Imagine we have some starting state (let's say at step 500) and we want to compare policy choices A, B, and C, after applying them for the same amount of time (let's say for 500 steps). So we'd restart the model 3 times from a milestone at step 500 (each with different config parameters representing the different policies) and run until step 1000 3 times and store the final state. In this case, we must be able to store multiple milestones at the same time step (1000).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add logging message to state the milestone that will be used.
TODO: only keep the tuple case; remove the int case.
run the model for each step until the step_target is reached. | ||
|
||
param: restart: boolean or int or a pair of ints, optional. | ||
If True, the model is run from last checkpoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to identify the "last checkpoint"
by a "milestone"
and "step"
as (milestone, step), correct? This can be merged with the third case to simplify the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion, there was some confusion here. The tuple case would be (step, instance), not (milestone, step). I'll try to clarify this in the doc string.
_save_model(f'./{self._model_identifier}', self.inputs) | ||
if save_milestone: | ||
milestone_path = _make_path_unique(f'./{self._model_identifier}/milestone_{self.step_count}') | ||
_save_model(milestone_path, self.inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find in the code where this directory milestone_path
is actually created, and how the permission issue is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated above, I will add an mkdir command to _save_model so we have control over where any (milestone) state save creates a new directory.
def _make_path_unique(path): | ||
if Path(path).exists(): | ||
incr = 1 | ||
def add_incr(path, incr): return f'{path}_{incr}' | ||
while Path(add_incr(path, incr)).exists(): incr += 1 | ||
path = add_incr(path, incr) | ||
return path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a number to the path isnot informative. I suggest generating unique directories using timestamp as:
from datetime import datetime
now = datetime.now()
timestamp = now.strftime("%Y%m%d_%H%M%S")
unique_path = f"{path}_{timestamp}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this would make it more straightforward to make a unique path, I think this will also make it harder to identify the milestone when trying to load it. For example, if you're trying to continue from 'the second milestone' you would have to know (or look up) which timestamp this milestone has.
This might still be the best approach, or we could combine the concepts by storing milestones as {path}_{step}_{timestamp}
.
Note that if we have any method for creating unique paths, we can always get the time at which that path was created from the OS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: keep instance numbering instead of the timestamp
TODO: add docstring to _make_path_unique
TODO: rename incr to instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add (str) description to config file. Not to be used by the program, but just for the researcher to describe the run or policy to test. Add comment to describe this value.
# Code version. | ||
self.version = Path('version.md').read_text().splitlines()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of saving the software version along with the model output. but how practical is it to use git commit hash? It is not clear how this works in practice. For example, who runs the regen_version.sh
file to store the commit hash, and how often is it run? Do the users use git to run experiments? Instead, we could use the software version specified in pyproject.toml
as version = "0.1.0"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion, I will add some guidance for this in the CONTRIBUTING.md
file.
In short, the version is stored in a tracked file. This makes it independent of the repository: the version file can be shipped with the built code (outside the repo) and the code would still work. However, tracking the file means it cannot use the 'current commit hash', because committing this version file (with the current hash) will create a newer hash.
Instead, we should run the regen_version.md
script before any release (or other 'important code version'). For example as part of the latest commit before tagging for release. It may be possible to do this using git actions, or we may have to do this manually, in which case it should be very clear from the CONTRIBUTING.md
I understand this may not be the ideal approach, but I could not find a better approach :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found that the relative path of Path('version.md') causes problems. Can you please make it absolute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed code version to process version to reflect that this captures other aspects of the process as well.
Tbh, I don't really know how best to get the path of the package root (as opposed to the model save directory). Will discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh, I don't really know how best to get the path of the package root (as opposed to the model save directory). Will discuss.
parent_path = Path(__file__).resolve().parents
, parents
is a list, see doc
model.step_target = 3 # only run the model till step 3 | ||
model.run() | ||
expected_generator_state = set(model.inputs["generator_state"].tolist()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some checks here before restarting the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which checks do you think are needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: check if the correct config file is there and whether the model is saved. Before the second run.
@@ -0,0 +1,3 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some documentation in this file or in a md file in a docs folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current md file in the docs folder is about running on snellius. I think it's better to add some documentation to the CONTRIBUTING.md file.
checkpoint_period: int = 10 | ||
milestones: Optional[List[PositiveInt]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add documentation/definitions for these two parameters somewhere?
assert model.inputs["step_count"] == 1 | ||
assert model.step_count == 5 | ||
assert stored_generator_state == expected_generator_state | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are missing for the case isinstance(restart, tuple)
e.g. model.run(restart=(2, 3))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanlankveldthijs nice work on adding the functionality 👍 I added some comments on the changed files, mainly about tests. Perhaps the tests can be added in a follow-up pull request.
All comments should now be processed. Pull request can be merged when reviewers agree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to add beyond our discussion
Resolves #53, #66 and #68
Adds option to also save model state at pre-specified time steps (milestones) and to restart from a stored milestone.
Adds tests for saving milestone and restarting from milestone.
Also has some related minor implementations: