only main process should call _save on deepspeed zero3 #25959

zjjMaiMai · 2023-09-04T15:36:16Z

Background

trainer._save call on all process after #25817. will raise FileExistsError when model save.

What does this PR do?

this pr fix it, trainer._save will call on main process only.

amyeroberts · 2023-09-05T11:28:34Z

cc @muellerzr @pacman100

muellerzr · 2023-09-05T13:05:38Z

src/transformers/trainer.py

+            if self.args.should_save:
+                self._save(output_dir, state_dict=state_dict)
+            # remove the dummy state_dict
+            remove_dummy_checkpoint(self.args.should_save, output_dir, [WEIGHTS_NAME, SAFE_WEIGHTS_NAME])
+            self.model_wrapped.save_checkpoint(output_dir)


This does not feel like the right solution, as should_save will check if we are the main process or not, and we should only ever be saving once. You can have self.model_wrapped.save_checkpoint under that self.args.should_save I believe, but I'm not sure this is the right "fix". Definitely cc @pacman100 here

self.model_wrapped.save_checkpoint needs to be called on each process, because each process only have part of model weight under deepspeed zero3.

Got it, thanks for the clarification!

pacman100

Hello, Thank you for the fix! However, it needs correction as explained in the comment,

pacman100 · 2023-09-05T15:32:19Z

src/transformers/trainer.py

+            if self.args.should_save:
+                self._save(output_dir, state_dict=state_dict)
+            # remove the dummy state_dict
+            remove_dummy_checkpoint(self.args.should_save, output_dir, [WEIGHTS_NAME, SAFE_WEIGHTS_NAME])


This would remove the legitimate model checkpoint when stage3_gather_16bit_weights_on_model_save=True. remove_dummy_checkpoint should only be called in the exception block.

thanks for your explanation, it has been fixed.

HuggingFaceDocBuilderDev · 2023-09-05T17:14:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

zjjMaiMai · 2023-09-06T08:44:32Z

@pacman100 any things i need to do?

pacman100

Thank you for iterating, LGTM!

amyeroberts · 2023-09-06T10:56:46Z

@zjjMaiMai One of the hub tests are failing, complaining that the base_model is empty when pushing to the hub. Could you try running this test locally to see whether it's a result of the changes in this PR?

zjjMaiMai · 2023-09-06T16:21:25Z

@zjjMaiMai One of the hub tests are failing, complaining that the base_model is empty when pushing to the hub. Could you try running this test locally to see whether it's a result of the changes in this PR?

$ pytest tests/trainer/test_trainer.py -k 'test_push_to_hub'
================================================================================================================================ test session starts ================================================================================================================================
platform linux -- Python 3.9.2, pytest-7.4.1, pluggy-1.3.0
configfile: setup.cfg
plugins: timeout-2.1.0, hypothesis-6.84.2, dash-2.13.0, xdist-3.3.1, anyio-3.7.1
collected 95 items / 91 deselected / 4 selected                                                                                                                                                                                                                                     

tests/trainer/test_trainer.py ssss                                                                                                                                                                                                                                            [100%]

================================================================================================================================= warnings summary ==================================================================================================================================
../../../../../../home/.local/lib/python3.9/site-packages/_pytest/config/__init__.py:1376
  /home/.local/lib/python3.9/site-packages/_pytest/config/__init__.py:1376: PytestConfigWarning: Unknown config option: doctest_glob
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================================================================== 4 skipped, 91 deselected, 1 warning in 1.69s ====================================================================================================================
$ git branch 
* fix_save_deepspeed_3
  main

amyeroberts · 2023-09-07T19:10:26Z

@zjjMaiMai Could you try and rebase on main? This should resolve the failing tests.

zjjMaiMai · 2023-09-11T06:49:56Z

All green! @amyeroberts

…5959) only main process should call _save when deepspeed zero3

muellerzr reviewed Sep 5, 2023

View reviewed changes

muellerzr requested a review from pacman100 September 5, 2023 13:06

pacman100 reviewed Sep 5, 2023

View reviewed changes

pacman100 approved these changes Sep 6, 2023

View reviewed changes

only main process should call _save when deepspeed zero3

4b438a8

amyeroberts merged commit 7fd2d68 into huggingface:main Sep 11, 2023

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023

only main process should call _save on deepspeed zero3 (huggingface#2…

1d22f73

…5959) only main process should call _save when deepspeed zero3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only main process should call _save on deepspeed zero3 #25959

only main process should call _save on deepspeed zero3 #25959

zjjMaiMai commented Sep 4, 2023

amyeroberts commented Sep 5, 2023

muellerzr Sep 5, 2023

zjjMaiMai Sep 5, 2023 •

edited

Loading

muellerzr Sep 5, 2023

pacman100 left a comment

pacman100 Sep 5, 2023

zjjMaiMai Sep 5, 2023

HuggingFaceDocBuilderDev commented Sep 5, 2023

zjjMaiMai commented Sep 6, 2023

pacman100 left a comment

amyeroberts commented Sep 6, 2023

zjjMaiMai commented Sep 6, 2023 •

edited

Loading

amyeroberts commented Sep 7, 2023

zjjMaiMai commented Sep 11, 2023

only main process should call _save on deepspeed zero3 #25959

only main process should call _save on deepspeed zero3 #25959

Conversation

zjjMaiMai commented Sep 4, 2023

Background

What does this PR do?

amyeroberts commented Sep 5, 2023

muellerzr Sep 5, 2023

Choose a reason for hiding this comment

zjjMaiMai Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

muellerzr Sep 5, 2023

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

pacman100 Sep 5, 2023

Choose a reason for hiding this comment

zjjMaiMai Sep 5, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 5, 2023

zjjMaiMai commented Sep 6, 2023

pacman100 left a comment

Choose a reason for hiding this comment

amyeroberts commented Sep 6, 2023

zjjMaiMai commented Sep 6, 2023 • edited Loading

amyeroberts commented Sep 7, 2023

zjjMaiMai commented Sep 11, 2023

zjjMaiMai Sep 5, 2023 •

edited

Loading

zjjMaiMai commented Sep 6, 2023 •

edited

Loading