[DeepSpeed] ZeRO stage 3 integration: getting started and issues #11044

stas00 · 2021-04-02T23:40:42Z

Why would you want ZeRO-3

In a few words, while ZeRO-2 was very limited scability-wise - if model.half() couldn't fit onto a single gpu, adding more gpus won't have helped so if you had a 24GB GPU you couldn't train a model larger than about 5B params.

Since with ZeRO-3 the model weights are partitioned across multiple GPUs plus offloaded to CPU, the upper limit on model size has increased by about 2 orders of magnitude. That is ZeRO-3 allows you to scale to huge models with Trillions of parameters assuming you have enough GPUs and general RAM to support this. ZeRO-3 can benefit a lot from general RAM if you have it. If not that's OK too. ZeRO-3 combines all your GPUs memory and general RAM into a vast pool of memory.

If you don't have many GPUs but just a single one but have a lot of general RAM ZeRO-3 will allow you to fit larger models.

Of course, if you run in an environment like the free google colab, while you can use run Deepspeed there, you get so little general RAM it's very hard to make something out of nothing. Some users (or some sessions) one gets 12GB of RAM which is impossible to work with - you want at least 24GB instances. Setting is up might be tricky too, please see this notebook for an example:
https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb

Getting started

Install the latest deepspeed version:

pip install deepspeed

You will want to be on a transformers master branch, if you want to run a quick test:

git clone https://github.com/huggingface/transformers
cd transformers
BS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/seq2seq/run_translation.py \
--model_name_or_path t5-small --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \
--max_val_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \
--do_train --num_train_epochs 1 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \
--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 0 --save_steps 0 \
--eval_steps 1 --group_by_length   --dataset_name wmt16 --dataset_config ro-en --source_lang en \
--target_lang ro --source_prefix "translate English to Romanian: " \
--deepspeed tests/deepspeed/ds_config_zero3.json

You will find a very detailed configuration here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed

Your new config file will look like this:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 3,
        "cpu_offload": true,
        "cpu_offload_params": true,
        "cpu_offload_use_pin_memory" : true,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 0.94e6,
        "stage3_param_persistence_threshold": 1e4,
        "reduce_bucket_size": 1e6,
        "prefetch_bucket_size": 3e6,
        "sub_group_size": 1e14,
        "stage3_gather_fp16_weights_on_model_save": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

So if you were already using ZeRO-2 it's only the zero_optimization stage that has changed.

One of the biggest nuances of ZeRO-3 is that the model weights aren't inside model.state_dict, as they are spread out through multiple gpus. The Trainer has been modified to support this but you will notice a slow model saving - as it has to consolidate weights from all the gpus. I'm planning to do more performance improvements in the future PRs, but for now let's focus on making things work.

Issues / Questions

If you have any general questions or something is unclear/missing in the docs please don't hesitate to ask in this thread. But for any bugs or problems please open a new Issue and tag me there. You don't need to tag anybody else. Thank you!

The text was updated successfully, but these errors were encountered:

stas00 · 2021-04-27T01:47:03Z

Superceded by #11464

sajastu · 2021-06-29T13:02:42Z

Hi @stas00,

Thank you for working on this amazing library. I looked into the deepspeed documentation for optimizers at https://deepspeed.readthedocs.io/en/latest/optimizers.html and there're a bunch of optimizers, but adafactor is not within those. As transformers has flag --adafactor to decided whether optimizer should be replaced with Adam(W), I'm wondering if making adafactor=True in transformers results in a conflict with deepspeed included.

So what is the workaround to this? it sounds like we are not able to use adafactor optimizer with deepspeed, and only can use those which are listed in deepspeed docs, right?

Thanks!
Sajad

stas00 · 2021-06-29T15:37:14Z

--adafactor should work just fine, this was just a buglet where that argument was ignored which has been fixed in #11749.

To use --adafactor or any other optimizer that is not native to Deepspeed you just need not configure the optimizer section in the ds_config.json file.

I guess I could expand on this here:
https://huggingface.co/transformers/master/main_classes/deepspeed.html#optimizer

sajastu · 2021-06-29T16:25:37Z

Thanks for your response. @stas00 I tried the way you mentioned (i.e., dropping "optimizer" part from config file). But it seems that Zero Offload is just able to work with DeepSpeed optimizers. The exact traceback is given below:

Traceback (most recent call last):
  File "examples/pytorch/summarization/run_summarization.py", line 617, in <module>
    main()
  File "examples/pytorch/summarization/run_summarization.py", line 541, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/trainer.py", line 1118, in train
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/deepspeed.py", line 329, in deepspeed_init
    raise ValueError("ZeRO Offload can only work with DeepSpeed optimizers")
ValueError: ZeRO Offload can only work with DeepSpeed optimizers

Update:
I comment out the error-causing lines (328-329) and it works fine now. I guess that might be useful if updating the doc? Since it doesn't work only with "not configuring the optimizer part", might be needed to make changes on other keys (such as zero_optimization.offload_optimizer) of config file as well. Just a suggestion :)

stas00 · 2021-06-29T17:46:37Z

wrt removing the verification, are you sure it's actually doing the right thing? Not failing doesn't necessarily mean it's working correctly.

sajastu · 2021-06-29T18:21:09Z

@stas00 It's my intuition that the error says: if want to use optimizer(s) other than DeepSpeed default ones, zero_optimization.offload_optimizer should be neglected since it just works with native DeepSpeed optimizers. Would commenting this assertion part cause any issues? As seems it just works fine (i.e., training loss is decreasing).

{'loss': 3.2968, 'learning_rate': 2.024227503252014e-05, 'epoch': 0.21}                                                                                                                                                       
{'loss': 3.0326, 'learning_rate': 2.2499999999999998e-05, 'epoch': 0.42}
...

stas00 · 2021-06-29T18:32:17Z

Let's ask Deepspeed devs: deepspeedai/DeepSpeed#1194

Meanwhile if it works for you, that's great! Thank you for doing the experiment.

stas00 · 2021-07-13T23:17:03Z

@sajastu, should be fixed in #12690

stas00 added the DeepSpeed label Apr 2, 2021

stas00 self-assigned this Apr 2, 2021

This was referenced Apr 5, 2021

Add parallelize method to GPT-neo models #11054

Closed

FP16 overflow with GPT-Neo when using sequence lengths of 2048. #11076

Closed

stas00 closed this as completed Apr 27, 2021

stas00 mentioned this issue May 17, 2021

[deepspeed] supporting --adafactor #11749

Closed

stas00 mentioned this issue Jun 29, 2021

Is it possible to use non-native Deepspeed optimizers with ZeRO-Offload? deepspeedai/DeepSpeed#1194

Closed

stas00 mentioned this issue Jul 13, 2021

[Deepspeed] non-native optimizers are mostly ok with zero-offload #12690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSpeed] ZeRO stage 3 integration: getting started and issues #11044

[DeepSpeed] ZeRO stage 3 integration: getting started and issues #11044

stas00 commented Apr 2, 2021 •

edited

Loading

stas00 commented Apr 27, 2021

sajastu commented Jun 29, 2021

stas00 commented Jun 29, 2021

sajastu commented Jun 29, 2021 •

edited

Loading

stas00 commented Jun 29, 2021

sajastu commented Jun 29, 2021

stas00 commented Jun 29, 2021 •

edited

Loading

stas00 commented Jul 13, 2021

[DeepSpeed] ZeRO stage 3 integration: getting started and issues #11044

[DeepSpeed] ZeRO stage 3 integration: getting started and issues #11044

Comments

stas00 commented Apr 2, 2021 • edited Loading

Why would you want ZeRO-3

Getting started

Issues / Questions

stas00 commented Apr 27, 2021

sajastu commented Jun 29, 2021

stas00 commented Jun 29, 2021

sajastu commented Jun 29, 2021 • edited Loading

stas00 commented Jun 29, 2021

sajastu commented Jun 29, 2021

stas00 commented Jun 29, 2021 • edited Loading

stas00 commented Jul 13, 2021

stas00 commented Apr 2, 2021 •

edited

Loading

sajastu commented Jun 29, 2021 •

edited

Loading

stas00 commented Jun 29, 2021 •

edited

Loading