-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Fix iter bug when resuming checkpoint in distributed train #866
Conversation
Signed-off-by: FreyWang <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #866 +/- ##
=======================================
Coverage 89.02% 89.02%
=======================================
Files 111 111
Lines 6043 6043
Branches 969 969
=======================================
Hits 5380 5380
Misses 467 467
Partials 196 196
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
Hi @FreyWang |
I met the problem when using |
Do you have any idea to adjust the number of iterations for |
Should we add notes for resume with different number of GPUs somewhere? Btw, I don't think resume with different batch size is a reasonable operator. |
Actually I agree that one resumed experiment should have same batch size with checkpoint. |
I agree too, maybe batch size should be limited to be same |
…n-mmlab#866) * [Fix] Fix iter bug when resuming checkpoint in distributed train * fix lint error Signed-off-by: FreyWang <[email protected]>
…lab#866) * Add require_serial to prevent multi-threading that causes file I/O conflict * Modify update_model_index.py to prevent redundant file I/O * Set sort_keys=True and update model .yml files
Motivation
As show in
https://github.com/open-mmlab/mmcv/blob/b4bfeb53c57f2c843cb5015f9c0a2d1689dba9c4/mmcv/runner/base_runner.py#L379 ,
cfg.gpu_ids
is used to calculate current iteration when resuming checkpoint inepoch_based_runner
andbased_runner
, but it isrange(1)
in distributed training in default, which will lead to a wrong iteration. Here fix it torange(world_size)