Failure in Training due to Empty Validation Metrics Dictionary when run_validation_freq is greater than 1 #1324

AlfredQin · 2023-07-26T10:29:44Z

🐛 Describe the bug

Bug: Failure in Training due to Empty Validation Metrics Dictionary

Description

When the run_validation_freq parameter in training_params of Trainer is set to a value greater than 1 (e.g., run_validation_freq: 5), the training process fails. This appears to be due to an issue with the validation metrics dictionary (valid_metrics_dict).

Details

The valid_metrics_dict is initialized as an empty dictionary. However, if the condition (epoch + 1) < run_validation_freq is met, the dictionary remains empty and is passed to the _write_to_disk_operations function. This results in an error when attempting to access validation_results_dict[self.metric_to_watch] in the _save_checkpoint function because the validation_results_dict is empty.

Here is the problematic code:

# RUN TEST ON VALIDATION SET EVERY self.run_validation_freq EPOCHS
valid_metrics_dict = {}
if (epoch + 1) % self.run_validation_freq == 0:
    ...
    valid_metrics_dict = self._validate_epoch(context=context, silent_mode=silent_mode)
    ...
...
self._write_to_disk_operations(
    train_metrics_dict=train_metrics_dict,
    validation_results_dict=valid_metrics_dict,
    ...
)
...
...
def _write_to_disk_operations(
        self,
        train_metrics_dict: dict,
        validation_results_dict: dict,
        test_metrics_dict: dict,
        lr_dict: dict,
        inf_time: float,
        epoch: int,
        context: PhaseContext,
    ):
     ...
        # SAVE THE CHECKPOINT
        if self.training_params.save_model:
            self._save_checkpoint(self.optimizer, epoch + 1, validation_results_dict, context)
...
def _save_checkpoint(
    self,
    optimizer: torch.optim.Optimizer = None,
    epoch: int = None,
    validation_results_dict: Optional[Dict[str, float]] = None,
    context: PhaseContext = None,
) -> None:
    ...
    metric = validation_results_dict[self.metric_to_watch]


### Versions

Set run_validation_freq > 1 in the training parameters.
Run the training process.
Observe the error when the training tries to save a checkpoint.

The text was updated successfully, but these errors were encountered:

BloodAxe · 2023-08-10T08:03:18Z

Thanks for the bug report. We will investigate this behavior

philmarchenko · 2023-11-01T15:31:32Z

@BloodAxe please, see this PR. I've attempted to fix this.

BloodAxe added the 🐛 Bug Something isn't working label Aug 10, 2023

shaydeci closed this as completed Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in Training due to Empty Validation Metrics Dictionary when run_validation_freq is greater than 1 #1324

Failure in Training due to Empty Validation Metrics Dictionary when run_validation_freq is greater than 1 #1324

AlfredQin commented Jul 26, 2023

BloodAxe commented Aug 10, 2023

philmarchenko commented Nov 1, 2023

Failure in Training due to Empty Validation Metrics Dictionary when run_validation_freq is greater than 1 #1324

Failure in Training due to Empty Validation Metrics Dictionary when run_validation_freq is greater than 1 #1324

Comments

AlfredQin commented Jul 26, 2023

🐛 Describe the bug

Bug: Failure in Training due to Empty Validation Metrics Dictionary

Description

Details

BloodAxe commented Aug 10, 2023

philmarchenko commented Nov 1, 2023