-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow variable number of repetitions for RA #5084
Conversation
💊 CI failures summary and remediationsAs of commit 9a89f9d (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tbennun. Overall it looks a reasonable addition. I've added a question below, let me know your thoughts.
parser.add_argument("--ra-sampler", action="store_true", help="whether to use ra_sampler in training") | ||
parser.add_argument("--ra-sampler", action="store_true", help="whether to use Repeated Augmentation in training") | ||
parser.add_argument( | ||
"--ra-reps", default=3, type=int, help="number of repetitions for Repeated Augmentation (default: 3)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on whether we should have both a boolean and an integer param VS combining the two a single? Other parameters (such as --clip-grad-norm
, --auto-augment
) take the latter approach by defining default=None
or default=0
to turn off the feature by default. Though having an extra boolean option is useful when you have many config parameters for the feature, it also creates noise when reading the training log to determine which inputs were actually active for the training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree, though I followed the style of the EMA flag(s). Should I switch to a more general --repeated-augmentations
with default 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input. :) When we introduced EMA, we offered 2 parameters --model-ema-steps
and --model-ema-decay
that's why we didnt follow this approach.
@sallysyw Any thoughts on which approach we should choose here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Two things to consider here:
- Users may not be aware of what is a "good value" for the number of repetitions. This is solvable though: it could be given as an example in the
help
of the argument. - In my fork, I'm adding another boolean flag to keep the same epoch length (rather than cutting it down by the number of repetitions) and replicate the samples within the same minibatch (and GPU), following the original Batch Augmentation paper. If it improves generalization more than the DeiT approach, would you be interested in this as part of another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerning point 1, I agree with your arguments and I don't have strong opinions. I think your PR is mergeable as-is but I'll let @sallysyw who contributed the class to decide whether we should go with 2 parameters or 1.
Concerning point 2, unless the gap in accuracy is massive, I would keep things simple and leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed the discussion before. I think for this specific usage case, having 2 parameters for RA are better than 1 because, as @tbennun mentioned, users might don't have a good sense on which ra_reps
to use. So it's important to have a non-zero default (also for the backward compatibility of replicating ViT training).
@tbennun Sounds good let's merge it. Do you have the log file and checkpoint for the ResNet50 run? |
Hey @datumbox! You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py |
Checkpoints:
The script didn't store a log file, so I only have the last few printouts (would be nice if it did by default :) ). However, it did store all the models from every epoch. I attached above the relevant two (last, best), and here is the log of running with The command line I used for training is as follows (as opposed to the blog, I trained on four GPUs rather than eight, but with the same overall batch size):
I am now testing the mode where a full epoch is used on the same system (training will see the same number of samples as the original, but with 4x augmentations). |
@tbennun Thanks for sharing the checkpoints and the info. This is great! It's a bummer you don't got the log. We aim for full reproducibility and we normally store the logs and all checkpoints for every model we release. I'm going to fire a new set of training with |
Summary: Co-authored-by: Vasilis Vryniotis <[email protected]> Reviewed By: prabhat00155 Differential Revision: D33253472 fbshipit-source-id: d193892d8cfcd8b7dbc6cc04603eb19f65c9c3e3
@tbennun Happy new year! I was able to verify your findings and get a run which achieves:
Would you be interested sending a PR so that you can be credited for the contribution? Given that you don't have access to deploy the weights, I can do that for you. I propose the following:
|
@datumbox Happy new year! That's great! I'll take care of the comments and the PR. I'm currently running the other version I mentioned before (saving all logs this time :) ), is it ok if we wait with merging the new PR until training completes? This way we can upload a modified version if the accuracy improvement is substantial. |
@tbennun Yes, no problem at all. If the accuracy diff is substantial, we can deploy yours. Though in that case, we would need to find a way to send us all the checkpoints and log files associated with the model. This is several hundreds of GB, so it might be a bit tricky but I'm sure we will work it out. |
@tbennun Just checking to see if you have any update concerning the training. Thanks! |
@datumbox yes. It's almost complete (epoch 552 right now) and max top1 accuracy is 81.05% (at epoch 513, top5 95.52%), but it's not getting any higher at this point. This is the exact same recipe but with the full BA. I kept the log and all checkpoints as well, but you can judge if this is something worth uploading. What do you think? |
@tbennun Sounds good, let's review the full details once the training is completed. Please upload somewhere the best epoch checkpoint once the training is completed along with the log so that we can check. If we do end up using your run, it's going to be a challenge fetching all the logs and checkpoints. Bear with us as we try to work these details out; you are the first person who wants to contribute a half TB run, so that comes with its own challenges. 😄 |
@datumbox Training complete.
Some notes:
diff --git a/references/classification/sampler.py b/references/classification/sampler.py
index 3c5e8b01..9ae16500 100644
--- a/references/classification/sampler.py
+++ b/references/classification/sampler.py
@@ -15,7 +15,7 @@ class RASampler(torch.utils.data.Sampler):
https://github.com/facebookresearch/deit/blob/main/samplers.py
"""
- def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, seed=0, repetitions=3):
+ def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, seed=0, repetitions=3, whole_dataset=True):
if num_replicas is None:
if not dist.is_available():
raise RuntimeError("Requires distributed package to be available!")
@@ -34,6 +34,7 @@ class RASampler(torch.utils.data.Sampler):
self.shuffle = shuffle
self.seed = seed
self.repetitions = repetitions
+ self.alldata = whole_dataset
def __iter__(self):
# Deterministically shuffle based on epoch
@@ -44,6 +45,9 @@ class RASampler(torch.utils.data.Sampler):
else:
indices = list(range(len(self.dataset)))
+ if self.alldata and self.num_replicas == self.repetitions:
+ return iter(indices)
+
# Add extra samples to make it evenly divisible
indices = [ele for ele in indices for i in range(self.repetitions)]
indices += indices[: (self.total_size - len(indices))]
@@ -56,7 +60,7 @@ class RASampler(torch.utils.data.Sampler):
return iter(indices[: self.num_selected_samples])
def __len__(self):
- return self.num_selected_samples
+ return len(self.dataset) if self.alldata else self.num_selected_samples
def set_epoch(self, epoch):
self.epoch = epoch |
@tbennun Thanks for the update. I assume that by nodes you meant the number of available GPUs. Judging from the patch you provided, I understand that each GPU actually runs the full size of the dataset, meaning that for Unfortunately I don't think we would like to make such a modification to the RASampler, because we try to keep our implementations canonical. Moreover, I suspect that one could achieve a similar result by running the training for 4x more epochs. On the other hand, your original proposal of using the If you send a PR with the changes described here, I can help you deploy the pre-trained model on PyTorch's S3 bucket and release it. Let me know what you think. |
@datumbox agreed on all counts. If that's alright with you, I will create the PR over this weekend. Thanks again for your patience! |
Of course, sounds great. Looking forward to the contribution. |
@datumbox Done: |
The number of repetitions in the RA sampler was set to 3. With this PR it can be changed through CLI flags.
Related to #5051.
cc @datumbox