Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor code to identify correct steps per epoch #325

Open
hvgazula opened this issue Apr 17, 2024 · 1 comment
Open

Refactor code to identify correct steps per epoch #325

hvgazula opened this issue Apr 17, 2024 · 1 comment
Assignees
Labels

Comments

@hvgazula
Copy link
Contributor

hvgazula commented Apr 17, 2024

Following is an example of how the steps per epoch are incorrect whereas the number of batches is correct. This is because the get_steps_per_epoch uses n_volumes whereas getting the number of batches entails iterating through the entire dataset which can be time-consuming. Currently, n_volumes is calculated by iterating through the first shard and multiplying its size with the total number of shards. But this only works when all shards have the same number of volumes. One option is to drop_remainder at the time of writing the shards itself.

loading data
n_volumes: 9
Function: load_custom_tfrec Total runtime: 0:00:01.757646 (HH:MM:SS)
n_volumes: 5
Function: load_custom_tfrec Total runtime: 0:00:01.943470 (HH:MM:SS)
Train Batches (@ 2 GPUS): 4
Eval Batches (@ 2 GPUS): 2
Train steps per epoch: 5
Eval steps per epoch: 3
@hvgazula hvgazula added the bug label Apr 17, 2024
@hvgazula hvgazula self-assigned this Apr 17, 2024
@hvgazula
Copy link
Contributor Author

Again, if from_files is used this will not be an issue the very first time because n_volumes is specified ahead of time. However, this can be problematic when from_tfrecords is used. Of course, the solution is #321 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant