GradientAccumulator wrapper not working as expected #2

andreped · 2022-06-01T16:49:09Z

In gradient accumulation, we try to update the weights only after a given number of iterations (k number of batches), in an ensemble-based manner. For instance, by averaging across the gradients calculated for k batches, and only updated the weights then - simulating regular batch training.

After running the benchmark described here, using:

batch_size=32, accum_steps=1, epochs=3
batch_size=8, accum_steps=4, epochs=12

We do not get the same results. It seems like the weights are updated for every batch even though we use accum_steps > 4.

Both the original wrapper implementation GradientAccumulator and the Adam-based wrapper AdamAccumulate suffer from this.

Are we actually able to control when the weights are updated from the optimizer, or can we only calculate and get the gradients and enforce and update ourselves?

Obviously we can make our own train loop, but the whole point is to have a simple wrapper class which handles all this for us.

andreped · 2022-06-01T18:46:11Z

Instructions on how to reproduce the issue has now been added here.

andreped · 2022-06-01T19:15:13Z

Silly me. It is obviously wrong to run accum_steps times more epochs. The correct number of updates are already performed for a given epoch (have corrected for this now: c0d4f1b).

When I run the same number of epochs using the train_step overload approach (accum_opt=-1), I get almost identical results! Hence, I believe the GradientAccumulator implementation is close to ready for single-GPU scenarios.

Added prompt below from some benchmarks (removed some prints for readability):

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 1 --accum_steps 32 --epochs 3
Namespace(accum_opt=-1, accum_steps=32, batchsize=1, epochs=3)
Epoch 1/3
60000/60000 [==============================] - 56s 912us/step - loss: 0.2667 - sparse_categorical_accuracy: 0.9236 - val_loss: 0.1346 - val_sparse_categorical_accuracy: 0.9581
Epoch 2/3
60000/60000 [==============================] - 54s 902us/step - loss: 0.1167 - sparse_categorical_accuracy: 0.9656 - val_loss: 0.0992 - val_sparse_categorical_accuracy: 0.9700
Epoch 3/3
60000/60000 [==============================] - 61s 1ms/step - loss: 0.0802 - sparse_categorical_accuracy: 0.9758 - val_loss: 0.0874 - val_sparse_categorical_accuracy: 0.9725
10000/10000 [==============================] - 6s 623us/step - loss: 0.0874 - sparse_categorical_accuracy: 0.9725
[0.08742792904376984, 0.9725000262260437]

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 32 --accum_steps 1 --epochs 3
Namespace(accum_opt=-1, accum_steps=1, batchsize=32, epochs=3)
Epoch 1/3
1875/1875 [==============================] - 5s 2ms/step - loss: 0.2659 - sparse_categorical_accuracy: 0.9233 - val_loss: 0.1337 - val_sparse_categorical_accuracy: 0.9585
Epoch 2/3
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1155 - sparse_categorical_accuracy: 0.9662 - val_loss: 0.0974 - val_sparse_categorical_accuracy: 0.9708
Epoch 3/3
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0798 - sparse_categorical_accuracy: 0.9754 - val_loss: 0.0850 - val_sparse_categorical_accuracy: 0.9738
313/313 [==============================] - 0s 889us/step - loss: 0.0850 - sparse_categorical_accuracy: 0.9738
[0.08499550819396973, 0.973800003528595]

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 1 --accum_steps 1 --epochs 3
Namespace(accum_opt=-1, accum_steps=1, batchsize=1, epochs=3)
2022-06-01 21:04:02.166575: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: 
Epoch 1/3
60000/60000 [==============================] - 74s 1ms/step - loss: 0.2108 - sparse_categorical_accuracy: 0.9384 - val_loss: 0.1428 - val_sparse_categorical_accuracy: 0.9607
Epoch 2/3
60000/60000 [==============================] - 70s 1ms/step - loss: 0.1263 - sparse_categorical_accuracy: 0.9663 - val_loss: 0.1531 - val_sparse_categorical_accuracy: 0.9664
Epoch 3/3
60000/60000 [==============================] - 70s 1ms/step - loss: 0.1067 - sparse_categorical_accuracy: 0.9741 - val_loss: 0.1801 - val_sparse_categorical_accuracy: 0.9611
10000/10000 [==============================] - 6s 604us/step - loss: 0.1801 - sparse_categorical_accuracy: 0.9611
[0.18014205992221832, 0.9610999822616577]

Note that using bs=1 & acs=32 actually improves results vs bs=acs=1. Hence, performing gradient accumulation actually improves performance.

Doing bs=1 & acs=32 vs bs=32 & acs=1 produces almost identical results. Theoretically, they should be identical. As I performed this on GPU and handled all random-seed-issue-hell, there is no real explanation why they are not identical. I therefore believe this is minor bug somewhere. Likely the last update of an epoch not being performed, or something like that.

Lastly, note that training with acs=32 & bs=1 is a lot slower than acs=1 & bs=32. Luckily, this is expected, and the drawback with using accumulated gradients. However, we make it possible to use a much larger batch size than we normally could! If possible, one could try to increase batch size and reduce accumulation steps, e.g., bs=8 & acs=4, which should yield identical results to the one above (except bs=1 & acs=1 obviously).

andreped · 2022-06-01T19:28:59Z

However, interestingly enough, I don't get identical results using the wrapper approach (accum_opt=2):

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt 2 --batchsize 1 --accum_steps 32 --epochs 3
Namespace(accum_opt=2, accum_steps=32, batchsize=1, epochs=3)
Epoch 1/3
60000/60000 [==============================] - 67s 1ms/step - loss: 0.2396 - sparse_categorical_accuracy: 0.9293 - val_loss: 0.1308 - val_sparse_categorical_accuracy: 0.9624
Epoch 2/3
60000/60000 [==============================] - 62s 1ms/step - loss: 0.1344 - sparse_categorical_accuracy: 0.9636 - val_loss: 0.1394 - val_sparse_categorical_accuracy: 0.9663
Epoch 3/3
60000/60000 [==============================] - 62s 1ms/step - loss: 0.1144 - sparse_categorical_accuracy: 0.9706 - val_loss: 0.1338 - val_sparse_categorical_accuracy: 0.9689
10000/10000 [==============================] - 6s 600us/step - loss: 0.1338 - sparse_categorical_accuracy: 0.9689
[0.13378407061100006, 0.9689000248908997]

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt 2 --batchsize 32 --accum_steps 1 --epochs 3
Namespace(accum_opt=2, accum_steps=1, batchsize=32, epochs=3)
Epoch 1/3
1875/1875 [==============================] - 6s 2ms/step - loss: 0.2659 - sparse_categorical_accuracy: 0.9233 - val_loss: 0.1337 - val_sparse_categorical_accuracy: 0.9585
Epoch 2/3
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1155 - sparse_categorical_accuracy: 0.9662 - val_loss: 0.0974 - val_sparse_categorical_accuracy: 0.9708
Epoch 3/3
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0798 - sparse_categorical_accuracy: 0.9754 - val_loss: 0.0850 - val_sparse_categorical_accuracy: 0.9738
313/313 [==============================] - 0s 895us/step - loss: 0.0850 - sparse_categorical_accuracy: 0.9738
[0.08499550819396973, 0.973800003528595]

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt 2 --batchsize 1 --accum_steps 1 --epochs 3
Namespace(accum_opt=2, accum_steps=1, batchsize=1, epochs=3)
Epoch 1/3
60000/60000 [==============================] - 71s 1ms/step - loss: 0.2108 - sparse_categorical_accuracy: 0.9384 - val_loss: 0.1428 - val_sparse_categorical_accuracy: 0.9607
Epoch 2/3
60000/60000 [==============================] - 67s 1ms/step - loss: 0.1263 - sparse_categorical_accuracy: 0.9663 - val_loss: 0.1531 - val_sparse_categorical_accuracy: 0.9664
Epoch 3/3
60000/60000 [==============================] - 67s 1ms/step - loss: 0.1067 - sparse_categorical_accuracy: 0.9741 - val_loss: 0.1801 - val_sparse_categorical_accuracy: 0.9611
10000/10000 [==============================] - 6s 601us/step - loss: 0.1801 - sparse_categorical_accuracy: 0.9611
[0.18014205992221832, 0.9610999822616577]

Naturally, bs=32 & acs=1 are identical between the two accum_opt (-1 vs 2), but that is not surprising, as setting acs=1 essentially means disabling gradient accumulation, and therefore this is just usual behaviour from regular optimization (e.g., using ADAM/SGD).

Performance is also a lot worse using the GA wrapper.

andreped · 2022-06-01T19:51:06Z

Just run a new experiment to further test the train_step overload approach:

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 32 --accum_steps 8 --epochs 24
Namespace(accum_opt=-1, accum_steps=8, batchsize=32, epochs=24)
Epoch 1/24
1875/1875 [==============================] - 5s 2ms/step - loss: 0.4577 - sparse_categorical_accuracy: 0.8751 - val_loss: 0.2493 - val_sparse_categorical_accuracy: 0.9287
Epoch 2/24
1875/1875 [==============================] - 3s 1ms/step - loss: 0.2141 - sparse_categorical_accuracy: 0.9391 - val_loss: 0.1825 - val_sparse_categorical_accuracy: 0.9461
Epoch 3/24
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1566 - sparse_categorical_accuracy: 0.9561 - val_loss: 0.1425 - val_sparse_categorical_accuracy: 0.9591

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 256 --accum_steps 1 --epochs 3
Namespace(accum_opt=-1, accum_steps=1, batchsize=256, epochs=3)
Epoch 1/3
235/235 [==============================] - 2s 4ms/step - loss: 0.4580 - sparse_categorical_accuracy: 0.8748 - val_loss: 0.2495 - val_sparse_categorical_accuracy: 0.9286
Epoch 2/3
235/235 [==============================] - 1s 3ms/step - loss: 0.2143 - sparse_categorical_accuracy: 0.9393 - val_loss: 0.1829 - val_sparse_categorical_accuracy: 0.9464
Epoch 3/3
235/235 [==============================] - 1s 2ms/step - loss: 0.1570 - sparse_categorical_accuracy: 0.9560 - val_loss: 0.1423 - val_sparse_categorical_accuracy: 0.9591

We get close to identical results from using bs=256 & acs=1 vs bs=32 & acs=8, which is expected behaviour! However, there is a minor difference. Here it is definitely due to #updates not being the same for each experiment, as the total number of samples (N=60000) is not a multiplum by 256 (60000 % 256 != 0).

But then we can conclude with that the train_step overload approach is a viable option. Sadly, the wrapper approach did not have the same success (yet).

andreped · 2022-06-01T20:17:51Z

Started a discussion #3 if anyone are interested in discussing this further. Will keep this Issue open for now for new users.

andreped · 2022-06-03T15:40:57Z

Fixed in 5f1a703

andreped added the bug Something isn't working label Jun 1, 2022

andreped self-assigned this Jun 1, 2022

This was referenced Jun 1, 2022

Gradient accumulate optimizer tensorflow/addons#2260

Closed

add gradient accumulator tensorflow/addons#2525

Closed

andreped changed the title ~~Updates happen too often?~~ GradientAccumulator wrapper not working as expected Jun 1, 2022

This was referenced Jun 1, 2022

Slight offset in result with/without GA #4

Closed

Gradient accumulation support? keras-team/tf-keras#107

Open

andreped closed this as completed Jun 3, 2022

andreped referenced this issue Oct 11, 2022

delete old accumulator wrapper

b85ac36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GradientAccumulator wrapper not working as expected #2

GradientAccumulator wrapper not working as expected #2

andreped commented Jun 1, 2022 •

edited

Loading

andreped commented Jun 1, 2022

andreped commented Jun 1, 2022 •

edited

Loading

andreped commented Jun 1, 2022

andreped commented Jun 1, 2022

andreped commented Jun 1, 2022

andreped commented Jun 3, 2022

GradientAccumulator wrapper not working as expected #2

GradientAccumulator wrapper not working as expected #2

Comments

andreped commented Jun 1, 2022 • edited Loading

andreped commented Jun 1, 2022

andreped commented Jun 1, 2022 • edited Loading

andreped commented Jun 1, 2022

andreped commented Jun 1, 2022

andreped commented Jun 1, 2022

andreped commented Jun 3, 2022

andreped commented Jun 1, 2022 •

edited

Loading

andreped commented Jun 1, 2022 •

edited

Loading