You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to write minimal code to track the total number of training batches seen so far in the logs for validation.
For non-distributed training, I simply add a training_batches_so_far variable in my lightning module init, increment it on training_step() and add it to the progress_bar and log fields in the output.
However I want to make sure I am doing this properly for distributed training. What is the simplest way to do this? Ideally, I would like to be able to control how various metrics are accumulated (sum, avg, max). In this case, the amalgamation would be to sum the training steps seen by each worker and add that to the central total. I found related issues #702 and #1165, but it is unclear to me what the simplest / best practice is for this.
The text was updated successfully, but these errors were encountered:
I thought I had this figured out by accumulating batch counts in training_epoch_end() -- however this is called after the validation, meaning my validation epoch did not have access to the total train batches. Any help would be appreciated.
My goal here is to just write simple code that properly accumulates batch counts regardless of what type of distributed training I am using. I'm sure pytorch lightning makes this simple, but I am having a difficult time figuring out exactly where to do the increments and accumulations.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am trying to write minimal code to track the total number of training batches seen so far in the logs for validation.
For non-distributed training, I simply add a
training_batches_so_far
variable in my lightning module init, increment it ontraining_step()
and add it to theprogress_bar
andlog
fields in the output.However I want to make sure I am doing this properly for distributed training. What is the simplest way to do this? Ideally, I would like to be able to control how various metrics are accumulated (sum, avg, max). In this case, the amalgamation would be to sum the training steps seen by each worker and add that to the central total. I found related issues #702 and #1165, but it is unclear to me what the simplest / best practice is for this.
The text was updated successfully, but these errors were encountered: