-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ds-inference bloom] tweaks #340
Conversation
@stas00 I used to get ~66msec / token (batch size = 1) for DS-inference with fp16. Can you confirm if you are also observing performance drops? |
I'm not sure if we are using the same hardware, I'm getting pretty similar performance, please see: and the diff from before 40msec => 44msec int8 itself is of course slower than fp16 |
So, you are saying you only dropped performance by 4msec? The int8 numbers match exactly for me. Also, if possible please let me know about your pytorch and CUDA versions |
no idea, you're not using the same machine, so it's normal to have different results. Even the GPUs could be slightly different I think, or perhaps PCIe type/channels.
specs:
|
* wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip
This PR is adding: