You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug: Failure in Training due to Empty Validation Metrics Dictionary
Description
When the run_validation_freq parameter in training_params of Trainer is set to a value greater than 1 (e.g., run_validation_freq: 5), the training process fails. This appears to be due to an issue with the validation metrics dictionary (valid_metrics_dict).
Details
The valid_metrics_dict is initialized as an empty dictionary. However, if the condition (epoch + 1) < run_validation_freq is met, the dictionary remains empty and is passed to the _write_to_disk_operations function. This results in an error when attempting to access validation_results_dict[self.metric_to_watch] in the _save_checkpoint function because the validation_results_dict is empty.
Here is the problematic code:
# RUN TEST ON VALIDATION SET EVERY self.run_validation_freq EPOCHSvalid_metrics_dict= {}
if (epoch+1) %self.run_validation_freq==0:
...
valid_metrics_dict=self._validate_epoch(context=context, silent_mode=silent_mode)
...
...
self._write_to_disk_operations(
train_metrics_dict=train_metrics_dict,
validation_results_dict=valid_metrics_dict,
...
)
...
...
def_write_to_disk_operations(
self,
train_metrics_dict: dict,
validation_results_dict: dict,
test_metrics_dict: dict,
lr_dict: dict,
inf_time: float,
epoch: int,
context: PhaseContext,
):
...
# SAVE THE CHECKPOINTifself.training_params.save_model:
self._save_checkpoint(self.optimizer, epoch+1, validation_results_dict, context)
...
def_save_checkpoint(
self,
optimizer: torch.optim.Optimizer=None,
epoch: int=None,
validation_results_dict: Optional[Dict[str, float]] =None,
context: PhaseContext=None,
) ->None:
...
metric=validation_results_dict[self.metric_to_watch]
### VersionsSetrun_validation_freq>1inthetrainingparameters.
Runthetrainingprocess.
Observetheerrorwhenthetrainingtriestosaveacheckpoint.
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
Bug: Failure in Training due to Empty Validation Metrics Dictionary
Description
When the
run_validation_freq
parameter intraining_params
ofTrainer
is set to a value greater than 1 (e.g.,run_validation_freq: 5
), the training process fails. This appears to be due to an issue with the validation metrics dictionary (valid_metrics_dict
).Details
The
valid_metrics_dict
is initialized as an empty dictionary. However, if the condition(epoch + 1) < run_validation_freq
is met, the dictionary remains empty and is passed to the_write_to_disk_operations
function. This results in an error when attempting to accessvalidation_results_dict[self.metric_to_watch]
in the_save_checkpoint
function because thevalidation_results_dict
is empty.Here is the problematic code:
The text was updated successfully, but these errors were encountered: