-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? #2301
Comments
Is there any guide to running inference on compressed models(especially ZeroQuant)? |
@xk503775229 bigscience-workshop/Megatron-DeepSpeed#339 |
@mayank31398 when i use the BLOOM way to load my checkpoint , GPT2 checkpoint type is not supported I was trying out the compression library for ZeroQuant quantization (for GPT-2 model). While I was able to compress the model, I didn't see any throughput/latency gain from the quantization during inference |
Can you share a code snippet you used for loading GPT? |
In general, the code is only supposed to work with Megatron checkpoints. |
Hi @xk503775229, Thanks for the interest in trying Int8 for other models. In general, you should be able to do so, however, one issue here is that you want to use this with loading checkpoint, which is not currently supported for all models. Regarding the Int8 inference, have you tried using the init_inference simply with passing int8 without a checkpoint json (let the model be loaded as it was originally with fp16)? |
hi, is there a timeline for the release of the int8 CUDA kernels? |
Hi This will be released as a part of (MII-Azure) later: https://github.com/microsoft/DeepSpeed-MII |
Closed for now. Please re-open it if you need further assistance |
i just use deepspeed ZeroQuant to compress my model ,but i dont known how to use deepspeed to inference it .Is there any discribe about it ?
The text was updated successfully, but these errors were encountered: