-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GradientAccumulator wrapper not working as expected #2
Comments
Instructions on how to reproduce the issue has now been added here. |
Silly me. It is obviously wrong to run When I run the same number of epochs using the Added prompt below from some benchmarks (removed some prints for readability):
Note that using bs=1 & acs=32 actually improves results vs bs=acs=1. Hence, performing gradient accumulation actually improves performance. Doing bs=1 & acs=32 vs bs=32 & acs=1 produces almost identical results. Theoretically, they should be identical. As I performed this on GPU and handled all random-seed-issue-hell, there is no real explanation why they are not identical. I therefore believe this is minor bug somewhere. Likely the last update of an epoch not being performed, or something like that. Lastly, note that training with acs=32 & bs=1 is a lot slower than acs=1 & bs=32. Luckily, this is expected, and the drawback with using accumulated gradients. However, we make it possible to use a much larger batch size than we normally could! If possible, one could try to increase batch size and reduce accumulation steps, e.g., bs=8 & acs=4, which should yield identical results to the one above (except bs=1 & acs=1 obviously). |
However, interestingly enough, I don't get identical results using the wrapper approach (
Naturally, bs=32 & acs=1 are identical between the two Performance is also a lot worse using the GA wrapper. |
Just run a new experiment to further test the train_step overload approach:
We get close to identical results from using bs=256 & acs=1 vs bs=32 & acs=8, which is expected behaviour! However, there is a minor difference. Here it is definitely due to #updates not being the same for each experiment, as the total number of samples (N=60000) is not a multiplum by 256 (60000 % 256 != 0). But then we can conclude with that the train_step overload approach is a viable option. Sadly, the wrapper approach did not have the same success (yet). |
Started a discussion #3 if anyone are interested in discussing this further. Will keep this Issue open for now for new users. |
Fixed in 5f1a703 |
In gradient accumulation, we try to update the weights only after a given number of iterations (k number of batches), in an ensemble-based manner. For instance, by averaging across the gradients calculated for k batches, and only updated the weights then - simulating regular batch training.
After running the benchmark described here, using:
We do not get the same results. It seems like the weights are updated for every batch even though we use accum_steps > 4.
Both the original wrapper implementation GradientAccumulator and the Adam-based wrapper AdamAccumulate suffer from this.
Are we actually able to control when the weights are updated from the optimizer, or can we only calculate and get the gradients and enforce and update ourselves?
Obviously we can make our own train loop, but the whole point is to have a simple wrapper class which handles all this for us.
The text was updated successfully, but these errors were encountered: