Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Did you experience increase in time while training? #29

Closed
BasselAli1 opened this issue Jan 7, 2019 · 1 comment
Closed

Did you experience increase in time while training? #29

BasselAli1 opened this issue Jan 7, 2019 · 1 comment

Comments

@BasselAli1
Copy link

BasselAli1 commented Jan 7, 2019

In the code we print the time in save_a_report and save_a_snapshot functions. I find out the time increases while training (starts small and keep increasing with more iterations). and sometimes it increases dramatically.
Example in save_a_snapshot:
from [iteration 6000]

i_epoch: 2 i_iter: 6000 val_loss:2.4745 val_acc:0.3974 runtime: 33.43 min

to [iteration 7000]

i_epoch: 3 i_iter: 7000 val_loss:2.4929 val_acc:0.3963 runtime: 260.19 min

Example in save_a_report:
from [iteration 9200]

iter: 9200 train_loss: 1.1883  train_score: 0.6186  avg_train_score: 0.6580 val_score: 0.3975 val_loss: 2.6476 time(s): 195.1 s

to [iterations 9300, 9400]

iter: 9300 train_loss: 1.1513  train_score: 0.6588  avg_train_score: 0.6578 val_score: 0.3871 val_loss: 2.5129 time(s): 1371.4 s
iter: 9400 train_loss: 1.1269  train_score: 0.6826  avg_train_score: 0.6573 val_score: 0.4008 val_loss: 2.6888 time(s): 205.5 s

So it increases continuously in general(by small steps) while training and sometimes it increases dramatically(by big steps)
Another example:
from [the first thousand iterations]

BEGIN TRAINING...
iter: 100 train_loss: 3.6458  train_score: 0.3031  avg_train_score: 0.1377 val_score: 0.3258 val_loss: 3.4013 time(s): 199.5 s
iter: 200 train_loss: 3.1655  train_score: 0.3158  avg_train_score: 0.2387 val_score: 0.3410 val_loss: 2.9842 time(s): 228.1 s
iter: 300 train_loss: 2.6502  train_score: 0.3777  avg_train_score: 0.3034 val_score: 0.3326 val_loss: 2.8516 time(s): 192.2 s
iter: 400 train_loss: 2.3548  train_score: 0.4258  avg_train_score: 0.3544 val_score: 0.3467 val_loss: 2.5927 time(s): 193.4 s
iter: 500 train_loss: 2.1484  train_score: 0.4705  avg_train_score: 0.4003 val_score: 0.3934 val_loss: 2.5520 time(s): 215.1 s
iter: 600 train_loss: 2.1211  train_score: 0.4840  avg_train_score: 0.4367 val_score: 0.3977 val_loss: 2.4975 time(s): 183.0 s
iter: 700 train_loss: 2.0060  train_score: 0.4648  avg_train_score: 0.4661 val_score: 0.3475 val_loss: 2.6645 time(s): 182.8 s
iter: 800 train_loss: 1.8998  train_score: 0.5230  avg_train_score: 0.4891 val_score: 0.3543 val_loss: 2.5015 time(s): 187.2 s
iter: 900 train_loss: 1.8344  train_score: 0.5258  avg_train_score: 0.5037 val_score: 0.3783 val_loss: 2.4491 time(s): 185.4 s
iter: 1000 train_loss: 1.7774  train_score: 0.5184  avg_train_score: 0.5165 val_score: 0.3938 val_loss: 2.5243 time(s): 183.8 s
i_epoch: 1 i_iter: 1000 val_loss:2.4742 val_acc:0.3838 runtime: 34.87 min

to [the thirteenth thousand iterations]

i_epoch: 5 i_iter: 13000 val_loss:2.7267 val_acc:0.3917 runtime: 54.67 min
iter: 13100 train_loss: 1.0795  train_score: 0.6867  avg_train_score: 0.6843 val_score: 0.3723 val_loss: 2.8208 time(s): 1550.9 s
iter: 13200 train_loss: 1.1232  train_score: 0.6627  avg_train_score: 0.6836 val_score: 0.4021 val_loss: 2.8624 time(s): 196.6 s
iter: 13300 train_loss: 1.0556  train_score: 0.6756  avg_train_score: 0.6826 val_score: 0.4186 val_loss: 2.5904 time(s): 210.9 s
iter: 13400 train_loss: 1.0774  train_score: 0.6979  avg_train_score: 0.6825 val_score: 0.4125 val_loss: 2.5742 time(s): 207.8 s
iter: 13500 train_loss: 1.0958  train_score: 0.6840  avg_train_score: 0.6843 val_score: 0.4084 val_loss: 2.5981 time(s): 201.2 s
iter: 13600 train_loss: 1.0693  train_score: 0.6816  avg_train_score: 0.6870 val_score: 0.4365 val_loss: 2.5409 time(s): 202.8 s
iter: 13700 train_loss: 1.1302  train_score: 0.6598  avg_train_score: 0.6871 val_score: 0.3939 val_loss: 2.7158 time(s): 197.0 s
iter: 13800 train_loss: 1.0662  train_score: 0.6736  avg_train_score: 0.6859 val_score: 0.3746 val_loss: 2.7563 time(s): 197.9 s
iter: 13900 train_loss: 1.0325  train_score: 0.6984  avg_train_score: 0.6857 val_score: 0.3762 val_loss: 2.9416 time(s): 214.2 s
iter: 14000 train_loss: 0.9614  train_score: 0.7232  avg_train_score: 0.6857 val_score: 0.3832 val_loss: 2.6673 time(s): 270.2 s
i_epoch: 5 i_iter: 14000 val_loss:2.6989 val_acc:0.3935 runtime: 60.91 min

I tried pytorch 4.0 and pytorch 1.0.
PS: I am training with datasets [imdb_train2014.npy, imdb_val2train2014.npy, imdb_genome.npy, imdb_vdtrain.npy] but I don't think that this will make any difference.

@apsdehal
Copy link
Contributor

apsdehal commented Jan 7, 2019

Check my comment on the issue #8. This will also be fixed in the new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants