-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add save and export to training servicer #228
Add save and export to training servicer #228
Conversation
0e7735a
to
b303405
Compare
b303405
to
fc70f78
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #228 +/- ##
==========================================
- Coverage 63.63% 62.57% -1.07%
==========================================
Files 47 47
Lines 2742 2840 +98
==========================================
+ Hits 1745 1777 +32
- Misses 997 1063 +66 ☔ View full report in Codecov by Sentry. |
5d7f533
to
e4bc4d6
Compare
e4bc4d6
to
9264874
Compare
9264874
to
02de953
Compare
- If the model was initial paused or running, save after completion retain the state, while temporarily pausing to perform the save. - The export will pause the training if not paused before.
02de953
to
597cfe8
Compare
file_path = file_path.with_suffix(".pth") | ||
|
||
state = { | ||
"num_epochs": self.num_epochs + 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason for the +1
? Maybe it deserves a comment. Pytorch themselves seem to save just the epoch: https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-a-general-checkpoint-for-inference-and-or-resuming-training
tiktorch/trainer.py
Outdated
""" | ||
On demand save of the training progress including the optimizer state. | ||
|
||
Note: pytorch-3dunet automatically saves the checkpoints in intervals defined by the `validation_after_iters`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording to me here is a bit unclear. As far as I remember pytorch-3dunet will keep "latest" and "best" checkpoints. This sounds a bit as if a history were retained (or is this actually the case)?
It builds upon #227 .
Adding the options to save and export a model for the training service.
When we save, the training is paused, and then we resume. For the export, we just pause.
TODO: