Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeepSpeed] ZeRO stage 3 integration: getting started and issues #11044

Closed
stas00 opened this issue Apr 2, 2021 · 8 comments
Closed

[DeepSpeed] ZeRO stage 3 integration: getting started and issues #11044

stas00 opened this issue Apr 2, 2021 · 8 comments
Assignees

Comments

@stas00
Copy link
Contributor

stas00 commented Apr 2, 2021

Why would you want ZeRO-3

In a few words, while ZeRO-2 was very limited scability-wise - if model.half() couldn't fit onto a single gpu, adding more gpus won't have helped so if you had a 24GB GPU you couldn't train a model larger than about 5B params.

Since with ZeRO-3 the model weights are partitioned across multiple GPUs plus offloaded to CPU, the upper limit on model size has increased by about 2 orders of magnitude. That is ZeRO-3 allows you to scale to huge models with Trillions of parameters assuming you have enough GPUs and general RAM to support this. ZeRO-3 can benefit a lot from general RAM if you have it. If not that's OK too. ZeRO-3 combines all your GPUs memory and general RAM into a vast pool of memory.

If you don't have many GPUs but just a single one but have a lot of general RAM ZeRO-3 will allow you to fit larger models.

Of course, if you run in an environment like the free google colab, while you can use run Deepspeed there, you get so little general RAM it's very hard to make something out of nothing. Some users (or some sessions) one gets 12GB of RAM which is impossible to work with - you want at least 24GB instances. Setting is up might be tricky too, please see this notebook for an example:
https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb

Getting started

Install the latest deepspeed version:

pip install deepspeed

You will want to be on a transformers master branch, if you want to run a quick test:

git clone https://github.com/huggingface/transformers
cd transformers
BS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/seq2seq/run_translation.py \
--model_name_or_path t5-small --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \
--max_val_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \
--do_train --num_train_epochs 1 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \
--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 0 --save_steps 0 \
--eval_steps 1 --group_by_length   --dataset_name wmt16 --dataset_config ro-en --source_lang en \
--target_lang ro --source_prefix "translate English to Romanian: " \
--deepspeed tests/deepspeed/ds_config_zero3.json

You will find a very detailed configuration here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed

Your new config file will look like this:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 3,
        "cpu_offload": true,
        "cpu_offload_params": true,
        "cpu_offload_use_pin_memory" : true,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 0.94e6,
        "stage3_param_persistence_threshold": 1e4,
        "reduce_bucket_size": 1e6,
        "prefetch_bucket_size": 3e6,
        "sub_group_size": 1e14,
        "stage3_gather_fp16_weights_on_model_save": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

So if you were already using ZeRO-2 it's only the zero_optimization stage that has changed.

One of the biggest nuances of ZeRO-3 is that the model weights aren't inside model.state_dict, as they are spread out through multiple gpus. The Trainer has been modified to support this but you will notice a slow model saving - as it has to consolidate weights from all the gpus. I'm planning to do more performance improvements in the future PRs, but for now let's focus on making things work.

Issues / Questions

If you have any general questions or something is unclear/missing in the docs please don't hesitate to ask in this thread. But for any bugs or problems please open a new Issue and tag me there. You don't need to tag anybody else. Thank you!

@stas00
Copy link
Contributor Author

stas00 commented Apr 27, 2021

Superceded by #11464

@sajastu
Copy link

sajastu commented Jun 29, 2021

Hi @stas00,

Thank you for working on this amazing library. I looked into the deepspeed documentation for optimizers at https://deepspeed.readthedocs.io/en/latest/optimizers.html and there're a bunch of optimizers, but adafactor is not within those. As transformers has flag --adafactor to decided whether optimizer should be replaced with Adam(W), I'm wondering if making adafactor=True in transformers results in a conflict with deepspeed included.

So what is the workaround to this? it sounds like we are not able to use adafactor optimizer with deepspeed, and only can use those which are listed in deepspeed docs, right?

Thanks!
Sajad

@stas00
Copy link
Contributor Author

stas00 commented Jun 29, 2021

--adafactor should work just fine, this was just a buglet where that argument was ignored which has been fixed in #11749.

To use --adafactor or any other optimizer that is not native to Deepspeed you just need not configure the optimizer section in the ds_config.json file.

I guess I could expand on this here:
https://huggingface.co/transformers/master/main_classes/deepspeed.html#optimizer

@sajastu
Copy link

sajastu commented Jun 29, 2021

Thanks for your response. @stas00 I tried the way you mentioned (i.e., dropping "optimizer" part from config file). But it seems that Zero Offload is just able to work with DeepSpeed optimizers. The exact traceback is given below:

Traceback (most recent call last):
  File "examples/pytorch/summarization/run_summarization.py", line 617, in <module>
    main()
  File "examples/pytorch/summarization/run_summarization.py", line 541, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/trainer.py", line 1118, in train
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/deepspeed.py", line 329, in deepspeed_init
    raise ValueError("ZeRO Offload can only work with DeepSpeed optimizers")
ValueError: ZeRO Offload can only work with DeepSpeed optimizers

Update:
I comment out the error-causing lines (328-329) and it works fine now. I guess that might be useful if updating the doc? Since it doesn't work only with "not configuring the optimizer part", might be needed to make changes on other keys (such as zero_optimization.offload_optimizer) of config file as well. Just a suggestion :)

@stas00
Copy link
Contributor Author

stas00 commented Jun 29, 2021

wrt removing the verification, are you sure it's actually doing the right thing? Not failing doesn't necessarily mean it's working correctly.

@sajastu
Copy link

sajastu commented Jun 29, 2021

@stas00 It's my intuition that the error says: if want to use optimizer(s) other than DeepSpeed default ones, zero_optimization.offload_optimizer should be neglected since it just works with native DeepSpeed optimizers. Would commenting this assertion part cause any issues? As seems it just works fine (i.e., training loss is decreasing).

{'loss': 3.2968, 'learning_rate': 2.024227503252014e-05, 'epoch': 0.21}                                                                                                                                                       
{'loss': 3.0326, 'learning_rate': 2.2499999999999998e-05, 'epoch': 0.42}
...

@stas00
Copy link
Contributor Author

stas00 commented Jun 29, 2021

Let's ask Deepspeed devs: deepspeedai/DeepSpeed#1194

Meanwhile if it works for you, that's great! Thank you for doing the experiment.

@stas00
Copy link
Contributor Author

stas00 commented Jul 13, 2021

@sajastu, should be fixed in #12690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants