perf: conv1d quantization #1601

ebraraktas · 2024-01-15T21:35:19Z

This PR adds quantized Conv1D inference on top of #1597.

With previous int8 quantization implementation, this quantized inference couldn't bring any speed up because quantization was bottleneck. To alleviate that, I implemented vectorized versions of int8 quantization:

Improvement in benchmark_ops.cc (on 1500x1152 input, as in whisper-tiny/encoder/conv2):

ISA	Avg. (master)	Avg (vectorized)	speedup
AVX	0.22 ms	0.07 ms	3.1x
AVX2	0.22 ms	0.07 ms	3.1x
AVX512	0.22 ms	0.034 ms	6.5x
NEON	6.74 ms	1.13 ms	6.0x

For AVX and AVX2, I couldn't implement it as efficient as for AVX512. Hence speed-up is less. AVX tests are run on Xeon W-2195 and Neon tests are run on Samsung S21.

In actual model inference I observed better improvement in total quantization duration in whisper encoder:

Previous implementation: Convolutions take ~53ms, of which ~35ms are quantizations.
Vectorized implementation: Convolutions take ~20ms, of which ~2ms are quantizations.

(These are measured on Samsung S21)

In conclusion, this improvement speeds up convolution inference by 2x compared to f32 GEMM implementation in #1597 . Hence combined speed-up compared to the implementation in the master is about 30x for single threaded inference on Samsung S21.

Questions to the Reviewer

I didn't modify is_quantizable implementations, so quantized Conv1D weights are not converted to requested data type for now. I can add this easily with the diff below.

Index: src/models/model.cc
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/models/model.cc b/src/models/model.cc
--- a/src/models/model.cc	(revision 428750d0678ba92aad3969881360a56b821f071d)
+++ b/src/models/model.cc	(date 1705352005242)
@@ -176,8 +176,17 @@
 
         // Convert "weight" variables to the expected compute type.
         // Other float variables (e.g. biases) may be converted to another float type.
-        if (is_quantizable(name))
+        if (is_quantizable(name)) {
+          auto old_last_dim = -1;
+          if (variable.rank() == 3) {
+            old_last_dim = variable.dim(2);
+            variable.reshape({variable.dim(0), variable.dim(1) * variable.dim(2)});
+          }
           ensure_dtype(name, variable, weight_dtype);
+          if (old_last_dim != -1) {
+            variable.reshape({variable.dim(0), variable.dim(1) / old_last_dim, old_last_dim});
+          }
+        }
         else if (is_convertible(variable, name)
                  && is_float_type(variable.dtype())
                  && variable.dtype() != float_dtype)

Similarly, I didn't modify is_linear_weight. To my understanding, quantized Conv1D weights can be considered linear but IDK the impacts, completely.
For DNNL and CUDA, I didn't implement quantized inference and it throws runtime exception. However, this may cause some issues for users who converted their models with INT8 quantization and trying to run on these backends. Aside from implementing for those backends, one option may be letting them to convert INT8 models without quantized Conv1D weights. I could easily add that functionality with some environment variable check in Conv1DSpec.__init__ but I am not sure if it is the best solution. See the snippet below:

class Conv1DSpec(model_spec.LayerSpec):
    def __init__(self):
        self.weight = None
        if os.getenv("CTRANSLATE2_CONV1D_NO_QUANTIZE", "0") != "1":
            self.weight_scale = model_spec.OPTIONAL
        self.bias = model_spec.OPTIONAL

minhthuc2502 · 2024-02-26T08:27:14Z

Hello @ebraraktas Can you confirm this issue SYSTRAN/faster-whisper#716 related with the recent PR #1597 ? According to my understanding, after this PR, it does work with quantization int8 even it is not as fast as expected?

minhthuc2502 · 2024-02-26T08:41:59Z

For the question first and second question, I think you can base on the name to filter the conv1d layer in order to do some preprocessing before the converting, we need to keep the current implementation for other weights to avoid the misunderstanding lately.
For the 3rd question, why it is related to the backend DNNL and CUDA?

ebraraktas · 2024-02-26T09:22:30Z

Hello @ebraraktas Can you confirm this issue SYSTRAN/faster-whisper#716 related with the recent PR #1597 ? According to my understanding, after this PR, it does work with quantization int8 even it is not as fast as expected?

I don't think #1597 causes the issue you mentioned, because it must output the same tensor as before, but faster. #1597 is implemented for float weights, and AFAIK Conv1D weights was float even if a model is running with `compute_type="INT8".

ebraraktas · 2024-02-26T09:27:18Z

For the question first and second question, I think you can base on the name to filter the conv1d layer in order to do some preprocessing before the converting, we need to keep the current implementation for other weights to avoid the misunderstanding lately.

For the 3rd question, why it is related to the backend DNNL and CUDA?

For 1-2, if it is OK for you, I will add the implementation you mentioned.
For 3rd: I didn't implement quantized Conv1D inference for DNNL and CUDA. However, if one loads a model with int8 compute type Conv1D weigths are loaded as int8, too. And at inference time we cannot pass float Conv1D weights to CUDA or DNNL implementation.

minhthuc2502 · 2024-02-26T10:46:25Z

I see. I think you can set the conv1d quantization ignored by default. It prevents other users from being confused about the feature they don't use.
The solution of setting the environment variable is fine for me.

ebraraktas · 2024-03-24T15:23:21Z

@minhthuc2502 With my last commit I implemented following:

conv weights are considered as quantizable (and linear), too.
If model is loaded to CUDA or WITH_DNNL is true, conv weighs are converted to float_dtype instead of weight_dtype as these backends doesn't support quantized inference. This load time conversion allows one to use quantized model with all backends, so we don't need enviroment variable trick mentioned above.

minhthuc2502 · 2024-03-25T09:25:08Z

LGTM! Thank you for your update.

homink · 2024-07-23T04:12:13Z

@ebraraktas This is amazing work! I wonder if you have plans to add groups in Conv1D, which is missing compared to PyTorch's Conv1d. I am browsing other ASR models to be supported in CTranslate2, and the missing groups argument in Conv1D is a bummer.

ebraraktas · 2024-07-23T15:36:47Z

@homink that seems doable, I think I have found a way for it. However, my implementation is for CPU only, and I have to do some research for CUDA.

BTW, which model will you try when this implemented?

homink · 2024-07-23T16:44:16Z

@ebraraktas I believe the most popular ASR models would be Whisper and Wav2Vec2 (including Wav2Vec2Bert used in Meta's Seamless M4T). I made a PR for the Wav2Vec2 model here but partially computes transformer blocks due to the missing groups. I tried some work-around like implementing the depthwise convolution processing in src/layers/wav2vec2.cc below but it performs slow. I think this doesn't accommodate CPU threading and that's why. The best way would be to let Conv1D work with groups. For GPU, I found some clues in cuDNN document and made diff as attached. But not sure if this would work. You may want to have a look at it. I am also working on the Wav2Vec2Bert model to make additional PR.

    Wav2Vec2Encoder::Wav2Vec2Encoder(const models::Model& model, const std::string& scope)
      : _feat_layer0(model, scope + "/feat_layer0")
      , _feat_layers(build_layers_list<const Wav2Vec2LayerNormConvLayer>(model,
                                                                        scope + "/feat_layer"))
      , _fp_norm(model, scope + "/fp_layer_norm")
      , _fp_ff(model, scope + "/fp_projection", nullptr, true)
      , _conv_layers(build_layers_list<const Wav2Vec2PosConvLayer>(model,
                                                                   scope + "/conv_layer"))
      , _num_heads(model.get_attribute_with_default<int32_t>(scope + "/num_heads", 8))
      , _transpose({0, 2, 1})
      , _layers(build_layers_list<const TransformerEncoderLayer>(model,
                                                                 scope + "/layer",
                                                                 _num_heads,
                                                                 /*pre_norm=*/true,
                                                                 ops::ActivationType::GELU))
      , _output_norm(model, scope + "/layer_norm")
      , _lm_head(model, scope + "/lm_head", nullptr, true)
    {
    }

    void Wav2Vec2Encoder::operator()(const StorageView& features, StorageView& output) {
      PROFILE("Wav2Vec2Encoder");

      if (features.rank() != 2)
        throw std::invalid_argument("Expected input features to have 2 dimensions, but got "
                                    + std::to_string(features.rank())
                                    + " dimension(s) instead");

      dim_t i, j, k, l;
      // Wav2Vec2FeatureExtractor-------------------------------------------------------------------------------------------------
      StorageView feat_buffer(features.dtype(), features.device());
      StorageView feat_buffer2(features.dtype(), features.device());
      feat_buffer = std::move(features);
      _feat_layer0(feat_buffer, output);
      feat_buffer = std::move(output);
      for (l = 0; l < _feat_layers.size(); l++) {
        (*_feat_layers[l])(feat_buffer, output);
        if (l < _feat_layers.size() - 1 ) {
          feat_buffer = std::move(output);
        }
      }
      _transpose(output, feat_buffer);
      //---------------------------------------------------------------------------------------------------------------------------

      // Wav2Vec2FeatureProjection-------------------------------------------------------------------------------------------------
      _fp_norm(feat_buffer, output); //{ 1, 66,  512}
      _fp_ff(output, feat_buffer);   //{ 1, 66, 1024}
      feat_buffer.expand_dims(0);
      //---------------------------------------------------------------------------------------------------------------------------

      // Wav2Vec2PositionalConvEmbedding-------------------------------------------------------------------------------------------
      _transpose(feat_buffer, feat_buffer2);       //{ 1, 1024, 66}
      std::vector<StorageView> splits;
      std::vector<StorageView> splits_buffer;
      const dim_t conv_groups = 16;

      // Create the output StorageView objects for each split.
      for (i = 0; i < conv_groups; ++i) {
        splits.emplace_back(StorageView(features.dtype(), features.device()));
        splits_buffer.emplace_back(StorageView(features.dtype(), features.device()));
      }

      // Create a vector of pointers to the splits for the Split operation.
      std::vector<StorageView*> split_pointers;
      std::vector<StorageView*> split_pointers_buffer;
      for (auto& split : splits) {
        split_pointers.push_back(&split);
      }
      for (auto& split_buffer : splits_buffer) {
        split_pointers_buffer.push_back(&split_buffer);
      }

      // Perform the split operation.
      ops::Split(1, std::vector<dim_t>(conv_groups, feat_buffer2.dim(1)/conv_groups))(feat_buffer2, split_pointers);

      // depthwise convolution
      for (l = 0; l < _conv_layers.size(); l++) {
        (*_conv_layers[l])(*split_pointers[l], *split_pointers_buffer[l]);
      }

      // concatenation
      std::vector<const StorageView*> const_split_pointers(split_pointers_buffer.begin(), split_pointers_buffer.end());
      ops::Concat(1)(const_split_pointers, feat_buffer2);

      _gelu(feat_buffer2, feat_buffer2);
      _transpose(feat_buffer2, output); // {1, 1024, 66} to {1, 66, 1024}
      ops::Add()(feat_buffer, output, feat_buffer2);
      //---------------------------------------------------------------------------------------------------------------------------

      // Wav2Vec2EncoderLayerStableLayerNorm---------------------------------------------------------------------------------------
      for (const auto& layer : _layers) {
        (*layer)(feat_buffer2, nullptr, feat_buffer);
        feat_buffer2 = std::move(feat_buffer);
      }
      _output_norm(feat_buffer2, feat_buffer);
      //---------------------------------------------------------------------------------------------------------------------------
      _lm_head(feat_buffer, output);
    }

  }

homink · 2024-07-23T21:35:23Z

@ebraraktas, I tried the clue about the GPU above but not working. There must be something more I am not aware of.

ebraraktas added 3 commits January 15, 2024 19:15

perf: implement quantized conv1d

f279740

perf: vectorize int8 quantization

5100dc6

feat: implement quantizable Conv1DSpec

42c6b96

ebraraktas force-pushed the perf/conv1d-quantization branch 2 times, most recently from fb8b004 to 067f413 Compare January 16, 2024 13:29

ebraraktas force-pushed the perf/conv1d-quantization branch from 067f413 to 41737db Compare February 20, 2024 14:42

fix: implement Vec::round fallback for armv7

ac99886

fix: handle quantization of conv weights while loading

bfb6261

ebraraktas force-pushed the perf/conv1d-quantization branch 2 times, most recently from 93fea0f to bfb6261 Compare March 24, 2024 15:13

minhthuc2502 merged commit 8994330 into OpenNMT:master Mar 25, 2024
17 checks passed

ebraraktas deleted the perf/conv1d-quantization branch March 26, 2024 10:03

ebraraktas mentioned this pull request Jul 29, 2024

feat: grouped conv1d #1749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: conv1d quantization #1601

perf: conv1d quantization #1601

ebraraktas commented Jan 15, 2024 •

edited

Loading

minhthuc2502 commented Feb 26, 2024

minhthuc2502 commented Feb 26, 2024 •

edited

Loading

ebraraktas commented Feb 26, 2024

ebraraktas commented Feb 26, 2024

minhthuc2502 commented Feb 26, 2024

ebraraktas commented Mar 24, 2024

minhthuc2502 commented Mar 25, 2024

homink commented Jul 23, 2024

ebraraktas commented Jul 23, 2024

homink commented Jul 23, 2024 •

edited

Loading

homink commented Jul 23, 2024

perf: conv1d quantization #1601

perf: conv1d quantization #1601

Conversation

ebraraktas commented Jan 15, 2024 • edited Loading

Questions to the Reviewer

minhthuc2502 commented Feb 26, 2024

minhthuc2502 commented Feb 26, 2024 • edited Loading

ebraraktas commented Feb 26, 2024

ebraraktas commented Feb 26, 2024

minhthuc2502 commented Feb 26, 2024

ebraraktas commented Mar 24, 2024

minhthuc2502 commented Mar 25, 2024

homink commented Jul 23, 2024

ebraraktas commented Jul 23, 2024

homink commented Jul 23, 2024 • edited Loading

homink commented Jul 23, 2024

ebraraktas commented Jan 15, 2024 •

edited

Loading

minhthuc2502 commented Feb 26, 2024 •

edited

Loading

homink commented Jul 23, 2024 •

edited

Loading