Refactor code to identify correct steps per epoch #325

hvgazula · 2024-04-17T12:28:00Z

Following is an example of how the steps per epoch are incorrect whereas the number of batches is correct. This is because the get_steps_per_epoch uses n_volumes whereas getting the number of batches entails iterating through the entire dataset which can be time-consuming. Currently, n_volumes is calculated by iterating through the first shard and multiplying its size with the total number of shards. But this only works when all shards have the same number of volumes. One option is to drop_remainder at the time of writing the shards itself.

loading data
n_volumes: 9
Function: load_custom_tfrec Total runtime: 0:00:01.757646 (HH:MM:SS)
n_volumes: 5
Function: load_custom_tfrec Total runtime: 0:00:01.943470 (HH:MM:SS)
Train Batches (@ 2 GPUS): 4
Eval Batches (@ 2 GPUS): 2
Train steps per epoch: 5
Eval steps per epoch: 3

The text was updated successfully, but these errors were encountered:

hvgazula · 2024-04-17T12:32:45Z

Again, if from_files is used this will not be an issue the very first time because n_volumes is specified ahead of time. However, this can be problematic when from_tfrecords is used. Of course, the solution is #321 :)

hvgazula added the bug label Apr 17, 2024

hvgazula self-assigned this Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor code to identify correct steps per epoch #325

Refactor code to identify correct steps per epoch #325

hvgazula commented Apr 17, 2024 •

edited

Loading

hvgazula commented Apr 17, 2024

Refactor code to identify correct steps per epoch #325

Refactor code to identify correct steps per epoch #325

Comments

hvgazula commented Apr 17, 2024 • edited Loading

hvgazula commented Apr 17, 2024

hvgazula commented Apr 17, 2024 •

edited

Loading