Did you experience increase in time while training? #29

BasselAli1 · 2019-01-07T16:03:37Z

In the code we print the time in save_a_report and save_a_snapshot functions. I find out the time increases while training (starts small and keep increasing with more iterations). and sometimes it increases dramatically.
Example in save_a_snapshot:
from [iteration 6000]

i_epoch: 2 i_iter: 6000 val_loss:2.4745 val_acc:0.3974 runtime: 33.43 min

to [iteration 7000]

i_epoch: 3 i_iter: 7000 val_loss:2.4929 val_acc:0.3963 runtime: 260.19 min

Example in save_a_report:
from [iteration 9200]

iter: 9200 train_loss: 1.1883  train_score: 0.6186  avg_train_score: 0.6580 val_score: 0.3975 val_loss: 2.6476 time(s): 195.1 s

to [iterations 9300, 9400]

iter: 9300 train_loss: 1.1513  train_score: 0.6588  avg_train_score: 0.6578 val_score: 0.3871 val_loss: 2.5129 time(s): 1371.4 s
iter: 9400 train_loss: 1.1269  train_score: 0.6826  avg_train_score: 0.6573 val_score: 0.4008 val_loss: 2.6888 time(s): 205.5 s

So it increases continuously in general(by small steps) while training and sometimes it increases dramatically(by big steps)
Another example:
from [the first thousand iterations]

BEGIN TRAINING...
iter: 100 train_loss: 3.6458  train_score: 0.3031  avg_train_score: 0.1377 val_score: 0.3258 val_loss: 3.4013 time(s): 199.5 s
iter: 200 train_loss: 3.1655  train_score: 0.3158  avg_train_score: 0.2387 val_score: 0.3410 val_loss: 2.9842 time(s): 228.1 s
iter: 300 train_loss: 2.6502  train_score: 0.3777  avg_train_score: 0.3034 val_score: 0.3326 val_loss: 2.8516 time(s): 192.2 s
iter: 400 train_loss: 2.3548  train_score: 0.4258  avg_train_score: 0.3544 val_score: 0.3467 val_loss: 2.5927 time(s): 193.4 s
iter: 500 train_loss: 2.1484  train_score: 0.4705  avg_train_score: 0.4003 val_score: 0.3934 val_loss: 2.5520 time(s): 215.1 s
iter: 600 train_loss: 2.1211  train_score: 0.4840  avg_train_score: 0.4367 val_score: 0.3977 val_loss: 2.4975 time(s): 183.0 s
iter: 700 train_loss: 2.0060  train_score: 0.4648  avg_train_score: 0.4661 val_score: 0.3475 val_loss: 2.6645 time(s): 182.8 s
iter: 800 train_loss: 1.8998  train_score: 0.5230  avg_train_score: 0.4891 val_score: 0.3543 val_loss: 2.5015 time(s): 187.2 s
iter: 900 train_loss: 1.8344  train_score: 0.5258  avg_train_score: 0.5037 val_score: 0.3783 val_loss: 2.4491 time(s): 185.4 s
iter: 1000 train_loss: 1.7774  train_score: 0.5184  avg_train_score: 0.5165 val_score: 0.3938 val_loss: 2.5243 time(s): 183.8 s
i_epoch: 1 i_iter: 1000 val_loss:2.4742 val_acc:0.3838 runtime: 34.87 min

to [the thirteenth thousand iterations]

i_epoch: 5 i_iter: 13000 val_loss:2.7267 val_acc:0.3917 runtime: 54.67 min
iter: 13100 train_loss: 1.0795  train_score: 0.6867  avg_train_score: 0.6843 val_score: 0.3723 val_loss: 2.8208 time(s): 1550.9 s
iter: 13200 train_loss: 1.1232  train_score: 0.6627  avg_train_score: 0.6836 val_score: 0.4021 val_loss: 2.8624 time(s): 196.6 s
iter: 13300 train_loss: 1.0556  train_score: 0.6756  avg_train_score: 0.6826 val_score: 0.4186 val_loss: 2.5904 time(s): 210.9 s
iter: 13400 train_loss: 1.0774  train_score: 0.6979  avg_train_score: 0.6825 val_score: 0.4125 val_loss: 2.5742 time(s): 207.8 s
iter: 13500 train_loss: 1.0958  train_score: 0.6840  avg_train_score: 0.6843 val_score: 0.4084 val_loss: 2.5981 time(s): 201.2 s
iter: 13600 train_loss: 1.0693  train_score: 0.6816  avg_train_score: 0.6870 val_score: 0.4365 val_loss: 2.5409 time(s): 202.8 s
iter: 13700 train_loss: 1.1302  train_score: 0.6598  avg_train_score: 0.6871 val_score: 0.3939 val_loss: 2.7158 time(s): 197.0 s
iter: 13800 train_loss: 1.0662  train_score: 0.6736  avg_train_score: 0.6859 val_score: 0.3746 val_loss: 2.7563 time(s): 197.9 s
iter: 13900 train_loss: 1.0325  train_score: 0.6984  avg_train_score: 0.6857 val_score: 0.3762 val_loss: 2.9416 time(s): 214.2 s
iter: 14000 train_loss: 0.9614  train_score: 0.7232  avg_train_score: 0.6857 val_score: 0.3832 val_loss: 2.6673 time(s): 270.2 s
i_epoch: 5 i_iter: 14000 val_loss:2.6989 val_acc:0.3935 runtime: 60.91 min

I tried pytorch 4.0 and pytorch 1.0.
PS: I am training with datasets [imdb_train2014.npy, imdb_val2train2014.npy, imdb_genome.npy, imdb_vdtrain.npy] but I don't think that this will make any difference.

The text was updated successfully, but these errors were encountered:

apsdehal · 2019-01-07T20:18:02Z

Check my comment on the issue #8. This will also be fixed in the new release.

apsdehal closed this as completed Jan 7, 2019

ChenyuGAO-CS mentioned this issue Jul 3, 2019

How can I train LoRRA on TextVQA dataset using multi-GPUs? #116

Closed

apsdehal added a commit that referenced this issue May 8, 2020

[fix] ISort and Black all of the files (#29)

28e9a3b

apsdehal added a commit that referenced this issue May 8, 2020

[fix] ISort and Black all of the files (#29)

3602ce7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did you experience increase in time while training? #29

Did you experience increase in time while training? #29

BasselAli1 commented Jan 7, 2019 •

edited

Loading

apsdehal commented Jan 7, 2019

Did you experience increase in time while training? #29

Did you experience increase in time while training? #29

Comments

BasselAli1 commented Jan 7, 2019 • edited Loading

apsdehal commented Jan 7, 2019

BasselAli1 commented Jan 7, 2019 •

edited

Loading