Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Fix iter bug when resuming checkpoint in distributed train #866

Merged
merged 2 commits into from
Sep 11, 2021

Conversation

FreyWang
Copy link
Contributor

@FreyWang FreyWang commented Sep 10, 2021

Motivation

As show in
https://github.com/open-mmlab/mmcv/blob/b4bfeb53c57f2c843cb5015f9c0a2d1689dba9c4/mmcv/runner/base_runner.py#L379 ,
cfg.gpu_ids is used to calculate current iteration when resuming checkpoint in epoch_based_runner and based_runner, but it is range(1) in distributed training in default, which will lead to a wrong iteration. Here fix it to range(world_size)

@codecov
Copy link

codecov bot commented Sep 10, 2021

Codecov Report

Merging #866 (02ad416) into master (872e544) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #866   +/-   ##
=======================================
  Coverage   89.02%   89.02%           
=======================================
  Files         111      111           
  Lines        6043     6043           
  Branches      969      969           
=======================================
  Hits         5380     5380           
  Misses        467      467           
  Partials      196      196           
Flag Coverage Δ
unittests 89.02% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 872e544...02ad416. Read the comment docs.

@Junjun2016
Copy link
Collaborator

Hi @FreyWang
Thanks for your contribution again.
It can resume the right number of iterations for epoch_based_runner which inherits resume from base_runner.
But for iter_based_runner which overwrites the resume function, maybe we need to adjust the max number of iterations.

@FreyWang
Copy link
Contributor Author

Hi @FreyWang
Thanks for your contribution again.
It can resume the right number of iterations for epoch_based_runner which inherits resume from base_runner.
But for iter_based_runner which overwrites the resume function, maybe we need to adjust the max number of iterations.

I met the problem when using epoch_based_runner, It seems iter_based_runner override the function and have no probelm, but epoch_based_runner did not

@Junjun2016
Copy link
Collaborator

Junjun2016 commented Sep 10, 2021

Hi @FreyWang
Thanks for your contribution again.
It can resume the right number of iterations for epoch_based_runner which inherits resume from base_runner.
But for iter_based_runner which overwrites the resume function, maybe we need to adjust the max number of iterations.

Do you have any idea to adjust the number of iterations for iter_based_runner?
We will discuss it in the next week.

@Junjun2016 Junjun2016 requested a review from xvjiarui September 10, 2021 05:19
@xvjiarui
Copy link
Collaborator

Should we add notes for resume with different number of GPUs somewhere?

Btw, I don't think resume with different batch size is a reasonable operator.

@FreyWang
Copy link
Contributor Author

Hi @FreyWang
Thanks for your contribution again.
It can resume the right number of iterations for epoch_based_runner which inherits resume from base_runner.
But for iter_based_runner which overwrites the resume function, maybe we need to adjust the max number of iterations.

Do you have any idea to adjust the number of iterations for iter_based_runner?
We will discuss it in the next week.

Actually I agree that one resumed experiment should have same batch size with checkpoint.

@FreyWang
Copy link
Contributor Author

Should we add notes for resume with different number of GPUs somewhere?

Btw, I don't think resume with different batch size is a reasonable operator.

I agree too, maybe batch size should be limited to be same

@Junjun2016 Junjun2016 merged commit c4c2fdc into open-mmlab:master Sep 11, 2021
bowenroom pushed a commit to bowenroom/mmsegmentation that referenced this pull request Feb 25, 2022
…n-mmlab#866)

* [Fix] Fix iter bug when resuming checkpoint in distributed train

* fix lint error

Signed-off-by: FreyWang <[email protected]>
@FreyWang FreyWang deleted the pr1 branch April 9, 2022 05:38
wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this pull request Dec 3, 2023
…lab#866)

* Add require_serial to prevent multi-threading that causes file I/O conflict
* Modify update_model_index.py to prevent redundant file I/O
* Set sort_keys=True and update model .yml files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants