[Fix] Fix iter bug when resuming checkpoint in distributed train #866

FreyWang · 2021-09-10T02:36:08Z

Motivation

As show in
https://github.com/open-mmlab/mmcv/blob/b4bfeb53c57f2c843cb5015f9c0a2d1689dba9c4/mmcv/runner/base_runner.py#L379 ,
cfg.gpu_ids is used to calculate current iteration when resuming checkpoint in epoch_based_runner and based_runner, but it is range(1) in distributed training in default, which will lead to a wrong iteration. Here fix it to range(world_size)

Signed-off-by: FreyWang <[email protected]>

codecov · 2021-09-10T02:53:46Z

Codecov Report

Merging #866 (02ad416) into master (872e544) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #866   +/-   ##
=======================================
  Coverage   89.02%   89.02%           
=======================================
  Files         111      111           
  Lines        6043     6043           
  Branches      969      969           
=======================================
  Hits         5380     5380           
  Misses        467      467           
  Partials      196      196

Flag	Coverage Δ
unittests	`89.02% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 872e544...02ad416. Read the comment docs.

Junjun2016 · 2021-09-10T04:56:41Z

Hi @FreyWang
Thanks for your contribution again.
It can resume the right number of iterations for epoch_based_runner which inherits resume from base_runner.
But for iter_based_runner which overwrites the resume function, maybe we need to adjust the max number of iterations.

FreyWang · 2021-09-10T05:14:43Z

Hi @FreyWang
Thanks for your contribution again.
It can resume the right number of iterations for epoch_based_runner which inherits resume from base_runner.
But for iter_based_runner which overwrites the resume function, maybe we need to adjust the max number of iterations.

I met the problem when using epoch_based_runner, It seems iter_based_runner override the function and have no probelm, but epoch_based_runner did not

Junjun2016 · 2021-09-10T05:16:58Z

Hi @FreyWang
Thanks for your contribution again.
It can resume the right number of iterations for epoch_based_runner which inherits resume from base_runner.
But for iter_based_runner which overwrites the resume function, maybe we need to adjust the max number of iterations.

Do you have any idea to adjust the number of iterations for iter_based_runner?
We will discuss it in the next week.

xvjiarui · 2021-09-10T06:18:49Z

Should we add notes for resume with different number of GPUs somewhere?

Btw, I don't think resume with different batch size is a reasonable operator.

FreyWang · 2021-09-10T06:56:25Z

Hi @FreyWang
Thanks for your contribution again.
It can resume the right number of iterations for epoch_based_runner which inherits resume from base_runner.
But for iter_based_runner which overwrites the resume function, maybe we need to adjust the max number of iterations.

Do you have any idea to adjust the number of iterations for iter_based_runner?
We will discuss it in the next week.

Actually I agree that one resumed experiment should have same batch size with checkpoint.

FreyWang · 2021-09-10T06:59:40Z

Should we add notes for resume with different number of GPUs somewhere?

Btw, I don't think resume with different batch size is a reasonable operator.

I agree too, maybe batch size should be limited to be same

…n-mmlab#866) * [Fix] Fix iter bug when resuming checkpoint in distributed train * fix lint error Signed-off-by: FreyWang <[email protected]>

…lab#866) * Add require_serial to prevent multi-threading that causes file I/O conflict * Modify update_model_index.py to prevent redundant file I/O * Set sort_keys=True and update model .yml files

FreyWang added 2 commits September 10, 2021 10:27

[Fix] Fix iter bug when resuming checkpoint in distributed train

57c69ec

fix lint error

02ad416

Signed-off-by: FreyWang <[email protected]>

Junjun2016 approved these changes Sep 10, 2021

View reviewed changes

Junjun2016 requested a review from xvjiarui September 10, 2021 05:19

xvjiarui approved these changes Sep 10, 2021

View reviewed changes

Junjun2016 merged commit c4c2fdc into open-mmlab:master Sep 11, 2021

zhouzaida mentioned this pull request Sep 23, 2021

Bug when checking if number of GPUs is different during resume open-mmlab/mmcv#1359

Closed

FreyWang deleted the pr1 branch April 9, 2022 05:38

yuzhms mentioned this pull request Oct 21, 2022

[Fix] Fix bugs about cfg.gpu_ids in distributed training open-mmlab/mmtracking#745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Fix iter bug when resuming checkpoint in distributed train #866

[Fix] Fix iter bug when resuming checkpoint in distributed train #866

FreyWang commented Sep 10, 2021 •

edited

Loading

codecov bot commented Sep 10, 2021 •

edited

Loading

Junjun2016 commented Sep 10, 2021

FreyWang commented Sep 10, 2021

Junjun2016 commented Sep 10, 2021 •

edited

Loading

xvjiarui commented Sep 10, 2021

FreyWang commented Sep 10, 2021

FreyWang commented Sep 10, 2021

[Fix] Fix iter bug when resuming checkpoint in distributed train #866

[Fix] Fix iter bug when resuming checkpoint in distributed train #866

Conversation

FreyWang commented Sep 10, 2021 • edited Loading

Motivation

codecov bot commented Sep 10, 2021 • edited Loading

Codecov Report

Junjun2016 commented Sep 10, 2021

FreyWang commented Sep 10, 2021

Junjun2016 commented Sep 10, 2021 • edited Loading

xvjiarui commented Sep 10, 2021

FreyWang commented Sep 10, 2021

FreyWang commented Sep 10, 2021

FreyWang commented Sep 10, 2021 •

edited

Loading

codecov bot commented Sep 10, 2021 •

edited

Loading

Junjun2016 commented Sep 10, 2021 •

edited

Loading