Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: conv1d quantization #1601

Merged
merged 5 commits into from
Mar 25, 2024

Conversation

ebraraktas
Copy link
Contributor

@ebraraktas ebraraktas commented Jan 15, 2024

This PR adds quantized Conv1D inference on top of #1597.

With previous int8 quantization implementation, this quantized inference couldn't bring any speed up because quantization was bottleneck. To alleviate that, I implemented vectorized versions of int8 quantization:

Improvement in benchmark_ops.cc (on 1500x1152 input, as in whisper-tiny/encoder/conv2):

ISA Avg. (master) Avg (vectorized) speedup
AVX 0.22 ms 0.07 ms 3.1x
AVX2 0.22 ms 0.07 ms 3.1x
AVX512 0.22 ms 0.034 ms 6.5x
NEON 6.74 ms 1.13 ms 6.0x

For AVX and AVX2, I couldn't implement it as efficient as for AVX512. Hence speed-up is less. AVX tests are run on Xeon W-2195 and Neon tests are run on Samsung S21.

In actual model inference I observed better improvement in total quantization duration in whisper encoder:

  • Previous implementation: Convolutions take ~53ms, of which ~35ms are quantizations.
  • Vectorized implementation: Convolutions take ~20ms, of which ~2ms are quantizations.

(These are measured on Samsung S21)

In conclusion, this improvement speeds up convolution inference by 2x compared to f32 GEMM implementation in #1597 . Hence combined speed-up compared to the implementation in the master is about 30x for single threaded inference on Samsung S21.

Questions to the Reviewer

  • I didn't modify is_quantizable implementations, so quantized Conv1D weights are not converted to requested data type for now. I can add this easily with the diff below.
Index: src/models/model.cc
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/models/model.cc b/src/models/model.cc
--- a/src/models/model.cc	(revision 428750d0678ba92aad3969881360a56b821f071d)
+++ b/src/models/model.cc	(date 1705352005242)
@@ -176,8 +176,17 @@
 
         // Convert "weight" variables to the expected compute type.
         // Other float variables (e.g. biases) may be converted to another float type.
-        if (is_quantizable(name))
+        if (is_quantizable(name)) {
+          auto old_last_dim = -1;
+          if (variable.rank() == 3) {
+            old_last_dim = variable.dim(2);
+            variable.reshape({variable.dim(0), variable.dim(1) * variable.dim(2)});
+          }
           ensure_dtype(name, variable, weight_dtype);
+          if (old_last_dim != -1) {
+            variable.reshape({variable.dim(0), variable.dim(1) / old_last_dim, old_last_dim});
+          }
+        }
         else if (is_convertible(variable, name)
                  && is_float_type(variable.dtype())
                  && variable.dtype() != float_dtype)
  • Similarly, I didn't modify is_linear_weight. To my understanding, quantized Conv1D weights can be considered linear but IDK the impacts, completely.
  • For DNNL and CUDA, I didn't implement quantized inference and it throws runtime exception. However, this may cause some issues for users who converted their models with INT8 quantization and trying to run on these backends. Aside from implementing for those backends, one option may be letting them to convert INT8 models without quantized Conv1D weights. I could easily add that functionality with some environment variable check in Conv1DSpec.__init__ but I am not sure if it is the best solution. See the snippet below:
class Conv1DSpec(model_spec.LayerSpec):
    def __init__(self):
        self.weight = None
        if os.getenv("CTRANSLATE2_CONV1D_NO_QUANTIZE", "0") != "1":
            self.weight_scale = model_spec.OPTIONAL
        self.bias = model_spec.OPTIONAL

@ebraraktas ebraraktas force-pushed the perf/conv1d-quantization branch 2 times, most recently from fb8b004 to 067f413 Compare January 16, 2024 13:29
@ebraraktas ebraraktas force-pushed the perf/conv1d-quantization branch from 067f413 to 41737db Compare February 20, 2024 14:42
@minhthuc2502
Copy link
Collaborator

Hello @ebraraktas Can you confirm this issue SYSTRAN/faster-whisper#716 related with the recent PR #1597 ? According to my understanding, after this PR, it does work with quantization int8 even it is not as fast as expected?

@minhthuc2502
Copy link
Collaborator

minhthuc2502 commented Feb 26, 2024

  • For the question first and second question, I think you can base on the name to filter the conv1d layer in order to do some preprocessing before the converting, we need to keep the current implementation for other weights to avoid the misunderstanding lately.
  • For the 3rd question, why it is related to the backend DNNL and CUDA?

@ebraraktas
Copy link
Contributor Author

Hello @ebraraktas Can you confirm this issue SYSTRAN/faster-whisper#716 related with the recent PR #1597 ? According to my understanding, after this PR, it does work with quantization int8 even it is not as fast as expected?

I don't think #1597 causes the issue you mentioned, because it must output the same tensor as before, but faster. #1597 is implemented for float weights, and AFAIK Conv1D weights was float even if a model is running with `compute_type="INT8".

@ebraraktas
Copy link
Contributor Author

  • For the question first and second question, I think you can base on the name to filter the conv1d layer in order to do some preprocessing before the converting, we need to keep the current implementation for other weights to avoid the misunderstanding lately.
  • For the 3rd question, why it is related to the backend DNNL and CUDA?
  • For 1-2, if it is OK for you, I will add the implementation you mentioned.
  • For 3rd: I didn't implement quantized Conv1D inference for DNNL and CUDA. However, if one loads a model with int8 compute type Conv1D weigths are loaded as int8, too. And at inference time we cannot pass float Conv1D weights to CUDA or DNNL implementation.

@minhthuc2502
Copy link
Collaborator

I see. I think you can set the conv1d quantization ignored by default. It prevents other users from being confused about the feature they don't use.
The solution of setting the environment variable is fine for me.

@ebraraktas ebraraktas force-pushed the perf/conv1d-quantization branch 2 times, most recently from 93fea0f to bfb6261 Compare March 24, 2024 15:13
@ebraraktas
Copy link
Contributor Author

@minhthuc2502 With my last commit I implemented following:

  • conv weights are considered as quantizable (and linear), too.
  • If model is loaded to CUDA or WITH_DNNL is true, conv weighs are converted to float_dtype instead of weight_dtype as these backends doesn't support quantized inference. This load time conversion allows one to use quantized model with all backends, so we don't need enviroment variable trick mentioned above.

@minhthuc2502
Copy link
Collaborator

LGTM! Thank you for your update.

@minhthuc2502 minhthuc2502 merged commit 8994330 into OpenNMT:master Mar 25, 2024
17 checks passed
@ebraraktas ebraraktas deleted the perf/conv1d-quantization branch March 26, 2024 10:03
@homink
Copy link
Contributor

homink commented Jul 23, 2024

@ebraraktas This is amazing work! I wonder if you have plans to add groups in Conv1D, which is missing compared to PyTorch's Conv1d. I am browsing other ASR models to be supported in CTranslate2, and the missing groups argument in Conv1D is a bummer.

@ebraraktas
Copy link
Contributor Author

@homink that seems doable, I think I have found a way for it. However, my implementation is for CPU only, and I have to do some research for CUDA.

BTW, which model will you try when this implemented?

@homink
Copy link
Contributor

homink commented Jul 23, 2024

@ebraraktas I believe the most popular ASR models would be Whisper and Wav2Vec2 (including Wav2Vec2Bert used in Meta's Seamless M4T). I made a PR for the Wav2Vec2 model here but partially computes transformer blocks due to the missing groups. I tried some work-around like implementing the depthwise convolution processing in src/layers/wav2vec2.cc below but it performs slow. I think this doesn't accommodate CPU threading and that's why. The best way would be to let Conv1D work with groups. For GPU, I found some clues in cuDNN document and made diff as attached. But not sure if this would work. You may want to have a look at it. I am also working on the Wav2Vec2Bert model to make additional PR.

Screenshot 2024-07-23 at 9 32 38 AM
    Wav2Vec2Encoder::Wav2Vec2Encoder(const models::Model& model, const std::string& scope)
      : _feat_layer0(model, scope + "/feat_layer0")
      , _feat_layers(build_layers_list<const Wav2Vec2LayerNormConvLayer>(model,
                                                                        scope + "/feat_layer"))
      , _fp_norm(model, scope + "/fp_layer_norm")
      , _fp_ff(model, scope + "/fp_projection", nullptr, true)
      , _conv_layers(build_layers_list<const Wav2Vec2PosConvLayer>(model,
                                                                   scope + "/conv_layer"))
      , _num_heads(model.get_attribute_with_default<int32_t>(scope + "/num_heads", 8))
      , _transpose({0, 2, 1})
      , _layers(build_layers_list<const TransformerEncoderLayer>(model,
                                                                 scope + "/layer",
                                                                 _num_heads,
                                                                 /*pre_norm=*/true,
                                                                 ops::ActivationType::GELU))
      , _output_norm(model, scope + "/layer_norm")
      , _lm_head(model, scope + "/lm_head", nullptr, true)
    {
    }

    void Wav2Vec2Encoder::operator()(const StorageView& features, StorageView& output) {
      PROFILE("Wav2Vec2Encoder");

      if (features.rank() != 2)
        throw std::invalid_argument("Expected input features to have 2 dimensions, but got "
                                    + std::to_string(features.rank())
                                    + " dimension(s) instead");

      dim_t i, j, k, l;
      // Wav2Vec2FeatureExtractor-------------------------------------------------------------------------------------------------
      StorageView feat_buffer(features.dtype(), features.device());
      StorageView feat_buffer2(features.dtype(), features.device());
      feat_buffer = std::move(features);
      _feat_layer0(feat_buffer, output);
      feat_buffer = std::move(output);
      for (l = 0; l < _feat_layers.size(); l++) {
        (*_feat_layers[l])(feat_buffer, output);
        if (l < _feat_layers.size() - 1 ) {
          feat_buffer = std::move(output);
        }
      }
      _transpose(output, feat_buffer);
      //---------------------------------------------------------------------------------------------------------------------------

      // Wav2Vec2FeatureProjection-------------------------------------------------------------------------------------------------
      _fp_norm(feat_buffer, output); //{ 1, 66,  512}
      _fp_ff(output, feat_buffer);   //{ 1, 66, 1024}
      feat_buffer.expand_dims(0);
      //---------------------------------------------------------------------------------------------------------------------------

      // Wav2Vec2PositionalConvEmbedding-------------------------------------------------------------------------------------------
      _transpose(feat_buffer, feat_buffer2);       //{ 1, 1024, 66}
      std::vector<StorageView> splits;
      std::vector<StorageView> splits_buffer;
      const dim_t conv_groups = 16;

      // Create the output StorageView objects for each split.
      for (i = 0; i < conv_groups; ++i) {
        splits.emplace_back(StorageView(features.dtype(), features.device()));
        splits_buffer.emplace_back(StorageView(features.dtype(), features.device()));
      }

      // Create a vector of pointers to the splits for the Split operation.
      std::vector<StorageView*> split_pointers;
      std::vector<StorageView*> split_pointers_buffer;
      for (auto& split : splits) {
        split_pointers.push_back(&split);
      }
      for (auto& split_buffer : splits_buffer) {
        split_pointers_buffer.push_back(&split_buffer);
      }

      // Perform the split operation.
      ops::Split(1, std::vector<dim_t>(conv_groups, feat_buffer2.dim(1)/conv_groups))(feat_buffer2, split_pointers);

      // depthwise convolution
      for (l = 0; l < _conv_layers.size(); l++) {
        (*_conv_layers[l])(*split_pointers[l], *split_pointers_buffer[l]);
      }

      // concatenation
      std::vector<const StorageView*> const_split_pointers(split_pointers_buffer.begin(), split_pointers_buffer.end());
      ops::Concat(1)(const_split_pointers, feat_buffer2);

      _gelu(feat_buffer2, feat_buffer2);
      _transpose(feat_buffer2, output); // {1, 1024, 66} to {1, 66, 1024}
      ops::Add()(feat_buffer, output, feat_buffer2);
      //---------------------------------------------------------------------------------------------------------------------------

      // Wav2Vec2EncoderLayerStableLayerNorm---------------------------------------------------------------------------------------
      for (const auto& layer : _layers) {
        (*layer)(feat_buffer2, nullptr, feat_buffer);
        feat_buffer2 = std::move(feat_buffer);
      }
      _output_norm(feat_buffer2, feat_buffer);
      //---------------------------------------------------------------------------------------------------------------------------
      _lm_head(feat_buffer, output);
    }

  }

@homink
Copy link
Contributor

homink commented Jul 23, 2024

@ebraraktas, I tried the clue about the GPU above but not working. There must be something more I am not aware of.

@ebraraktas ebraraktas mentioned this pull request Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants