-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: conv1d quantization #1601
perf: conv1d quantization #1601
Conversation
fb8b004
to
067f413
Compare
067f413
to
41737db
Compare
Hello @ebraraktas Can you confirm this issue SYSTRAN/faster-whisper#716 related with the recent PR #1597 ? According to my understanding, after this PR, it does work with quantization int8 even it is not as fast as expected? |
|
I don't think #1597 causes the issue you mentioned, because it must output the same tensor as before, but faster. #1597 is implemented for |
|
I see. I think you can set the conv1d quantization ignored by default. It prevents other users from being confused about the feature they don't use. |
93fea0f
to
bfb6261
Compare
@minhthuc2502 With my last commit I implemented following:
|
LGTM! Thank you for your update. |
@ebraraktas This is amazing work! I wonder if you have plans to add groups in Conv1D, which is missing compared to PyTorch's Conv1d. I am browsing other ASR models to be supported in CTranslate2, and the missing groups argument in Conv1D is a bummer. |
@homink that seems doable, I think I have found a way for it. However, my implementation is for CPU only, and I have to do some research for CUDA. BTW, which model will you try when this implemented? |
@ebraraktas I believe the most popular ASR models would be Whisper and Wav2Vec2 (including Wav2Vec2Bert used in Meta's Seamless M4T). I made a PR for the Wav2Vec2 model here but partially computes transformer blocks due to the missing groups. I tried some work-around like implementing the depthwise convolution processing in src/layers/wav2vec2.cc below but it performs slow. I think this doesn't accommodate CPU threading and that's why. The best way would be to let Conv1D work with groups. For GPU, I found some clues in cuDNN document and made diff as attached. But not sure if this would work. You may want to have a look at it. I am also working on the Wav2Vec2Bert model to make additional PR. ![]()
|
@ebraraktas, I tried the clue about the GPU above but not working. There must be something more I am not aware of. |
This PR adds quantized
Conv1D
inference on top of #1597.With previous
int8
quantization implementation, this quantized inference couldn't bring any speed up because quantization was bottleneck. To alleviate that, I implemented vectorized versions ofint8
quantization:Improvement in
benchmark_ops.cc
(on1500x1152
input, as inwhisper-tiny/encoder/conv2
):For AVX and AVX2, I couldn't implement it as efficient as for AVX512. Hence speed-up is less. AVX tests are run on Xeon W-2195 and Neon tests are run on Samsung S21.
In actual model inference I observed better improvement in total quantization duration in whisper encoder:
(These are measured on Samsung S21)
In conclusion, this improvement speeds up convolution inference by 2x compared to f32 GEMM implementation in #1597 . Hence combined speed-up compared to the implementation in the
master
is about 30x for single threaded inference on Samsung S21.Questions to the Reviewer
is_quantizable
implementations, so quantizedConv1D
weights are not converted to requested data type for now. I can add this easily with the diff below.is_linear_weight
. To my understanding, quantizedConv1D
weights can be consideredlinear
but IDK the impacts, completely.INT8
quantization and trying to run on these backends. Aside from implementing for those backends, one option may be letting them to convertINT8
models without quantized Conv1D weights. I could easily add that functionality with some environment variable check inConv1DSpec.__init__
but I am not sure if it is the best solution. See the snippet below: