questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? #2301

xk503775229 · 2022-09-07T05:44:57Z

i just use deepspeed ZeroQuant to compress my model ,but i dont known how to use deepspeed to inference it .Is there any discribe about it ?

xk503775229 · 2022-09-07T06:41:35Z

Is there any guide to running inference on compressed models(especially ZeroQuant)?
Any help would be appreciated.

mayank31398 · 2022-09-07T19:28:26Z

@xk503775229 bigscience-workshop/Megatron-DeepSpeed#339
This PR adds support for BLOOM ds-inference with fp16 and int8.
The README is not up-to-date yet. I will work on fixing that.

xk503775229 · 2022-09-16T04:07:43Z

@mayank31398 when i use the BLOOM way to load my checkpoint ,

GPT2 checkpoint type is not supported

I was trying out the compression library for ZeroQuant quantization (for GPT-2 model). While I was able to compress the model, I didn't see any throughput/latency gain from the quantization during inference
Any help would be appreciated.

mayank31398 · 2022-09-16T06:00:10Z

Can you share a code snippet you used for loading GPT?
Also, currently, DS-inference uses fp16 special CUDA kernels for inference which is not the case for int8. int8 CUDA kernels will be released later which are much faster than fp16.

xk503775229 · 2022-09-16T06:32:22Z

Many thanks
The following is my code snippet used for loading GPT.

checkpoint.json:

{"type": "GPT2", "checkpoints": ["/root/DeepSpeedExamples/model_compression/gpt2/output/W8A8/pytorch_model.bin"], "version": 1.0}

mayank31398 · 2022-09-16T11:32:22Z

In general, the code is only supposed to work with Megatron checkpoints.
But there is an exception for BLOOM.
Not sure about the reason.
@jeffra can you comment?
I am not sure, I see the following in the DeepSpeed code:
https://github.com/microsoft/DeepSpeed/blob/cf638be99803682933cb4040850765d46832ee78/deepspeed/runtime/state_dict_factory.py#L22-L46

RezaYazdaniAminabadi · 2022-09-19T16:54:01Z

Hi @xk503775229,

Thanks for the interest in trying Int8 for other models. In general, you should be able to do so, however, one issue here is that you want to use this with loading checkpoint, which is not currently supported for all models. Regarding the Int8 inference, have you tried using the init_inference simply with passing int8 without a checkpoint json (let the model be loaded as it was originally with fp16)?
Thanks,
Reza

david-macleod · 2022-10-20T18:43:25Z

Can you share a code snippet you used for loading GPT? Also, currently, DS-inference uses fp16 special CUDA kernels for inference which is not the case for int8. int8 CUDA kernels will be released later which are much faster than fp16.

hi, is there a timeline for the release of the int8 CUDA kernels?

yaozhewei · 2022-11-04T21:46:47Z

Hi This will be released as a part of (MII-Azure) later: https://github.com/microsoft/DeepSpeed-MII

yaozhewei · 2022-11-11T19:39:58Z

Closed for now. Please re-open it if you need further assistance

xk503775229 changed the title ~~questuion : how to inference Int8 models supported through ZeroQuant technology ?~~ questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? Sep 7, 2022

jeffra added the inference label Sep 26, 2022

awan-10 added the enhancement New feature or request label Oct 28, 2022

awan-10 assigned yaozhewei and cli99 Oct 28, 2022

yaozhewei closed this as completed Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? #2301

questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? #2301

xk503775229 commented Sep 7, 2022

xk503775229 commented Sep 7, 2022

mayank31398 commented Sep 7, 2022 •

edited

Loading

xk503775229 commented Sep 16, 2022

mayank31398 commented Sep 16, 2022

xk503775229 commented Sep 16, 2022

mayank31398 commented Sep 16, 2022 •

edited

Loading

RezaYazdaniAminabadi commented Sep 19, 2022

david-macleod commented Oct 20, 2022

yaozhewei commented Nov 4, 2022

yaozhewei commented Nov 11, 2022

questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? #2301

questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? #2301

Comments

xk503775229 commented Sep 7, 2022

xk503775229 commented Sep 7, 2022

mayank31398 commented Sep 7, 2022 • edited Loading

xk503775229 commented Sep 16, 2022

mayank31398 commented Sep 16, 2022

xk503775229 commented Sep 16, 2022

mayank31398 commented Sep 16, 2022 • edited Loading

RezaYazdaniAminabadi commented Sep 19, 2022

david-macleod commented Oct 20, 2022

yaozhewei commented Nov 4, 2022

yaozhewei commented Nov 11, 2022

mayank31398 commented Sep 7, 2022 •

edited

Loading

mayank31398 commented Sep 16, 2022 •

edited

Loading