Merge pull request #11 from mobiusml/fw_compliance

Adding further changes before PR
SYSTRAN · May 24, 2024 · 8fd2ec0 · 8fd2ec0
2 parents 3f27636 + 2dde3c9
commit 8fd2ec0
Show file tree

Hide file tree

Showing 16 changed files with 2,079 additions and 50 deletions.
diff --git a/README.md b/README.md
@@ -75,28 +75,35 @@ Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. T
 
 GPU execution requires the following NVIDIA libraries to be installed:
 
-* [cuBLAS for CUDA 11](https://developer.nvidia.com/cublas)
-* [cuDNN 8 for CUDA 11](https://developer.nvidia.com/cudnn)
+* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
+* [cuDNN 8 for CUDA 12](https://developer.nvidia.com/cudnn)
 
-There are multiple ways to install these libraries. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below.
+**Note**: Latest versions of `ctranslate2` support CUDA 12 only. For CUDA 11, the current workaround is downgrading to the `3.24.0` version of `ctranslate2` (This can be done with `pip install --force-reinstall ctranslate2==3.24.0` or specifying the version in a `requirements.txt`).
+
+There are multiple ways to install the NVIDIA libraries mentioned above. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below. 
 
 <details>
 <summary>Other installation methods (click to expand)</summary>
 
+
+**Note:** For all these methods below, keep in mind the above note regarding CUDA versions. Depending on your setup, you may need to install the _CUDA 11_ versions of libraries that correspond to the CUDA 12 libraries listed in the instructions below.
+
 #### Use Docker
 
-The libraries are installed in this official NVIDIA Docker image: `nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04`.
+The libraries (cuBLAS, cuDNN) are installed in these official NVIDIA CUDA Docker images: `nvidia/cuda:12.0.0-runtime-ubuntu20.04` or `nvidia/cuda:12.0.0-runtime-ubuntu22.04`.
 
 #### Install with `pip` (Linux only)
 
 On Linux these libraries can be installed with `pip`. Note that `LD_LIBRARY_PATH` must be set before launching Python.
 
 ```bash
-pip install nvidia-cublas-cu11 nvidia-cudnn-cu11
+pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
 
 export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
 ```
 
+**Note**: Version 9+ of `nvidia-cudnn-cu12` appears to cause issues due its reliance on cuDNN 9 (Faster-Whisper does not currently support cuDNN 9). Ensure your version of the Python package is for cuDNN 8.
+
 #### Download the libraries from Purfview's repository (Windows & Linux)
 
 Purfview's [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) provides the required NVIDIA libraries for Windows & Linux in a [single archive](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Decompress the archive and place the libraries in a directory included in the `PATH`.
@@ -162,7 +169,7 @@ segments = list(segments)  # The transcription will actually run here.
 
 ### multi-segment language detection
 
-To directly use the model for improved language detection, following code snippet can be used:
+To directly use the model for improved language detection, the following code snippet can be used:
 
 ```python
 from faster_whisper import WhisperModel
@@ -172,10 +179,10 @@ language_info = model.detect_language_multi_segment("audio.mp3")
 
 ### Batched faster-whisper
 
-The batched version of faster-whisper is inspired by [whisper-x](https://github.com/m-bain/whisperX) licensed under the BSD-4 Clause license. This product includes software developed by Max Bain. We modify this implementation and also added kaldi-based feature extraction. It improves the speed upto 10-12x compared to openAI implementation. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference. 
 
+The batched version of faster-whisper is inspired by [whisper-x](https://github.com/m-bain/whisperX) licensed under the BSD-4 Clause license. This product includes software developed by Max Bain. We modify this implementation and also added kaldi-based feature extraction. It improves the speed upto 10-12x compared to openAI implementation and 3-4x compared to the sequential faster_whisper version. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference. 
 
-The following code snippet illustrates how to run inference with batched version on a specified audio file. Please also refer to the test scripts of batched faster whisper.
+The following code snippet illustrates how to run inference with batched version on an example audio file. Please also refer to the test scripts of batched faster whisper.
 
 ```python
 from faster_whisper import BatchedInferencePipeline

diff --git a/benchmark/benchmark.m4a b/benchmark/benchmark.m4a
diff --git a/benchmark/memory_benchmark.py b/benchmark/memory_benchmark.py
@@ -0,0 +1,94 @@
+import argparse
+import time
+
+from typing import Callable
+
+import py3nvml.py3nvml as nvml
+
+from memory_profiler import memory_usage
+from utils import MyThread, get_logger, inference
+
+logger = get_logger("faster-whisper")
+parser = argparse.ArgumentParser(description="Memory benchmark")
+parser.add_argument(
+    "--gpu_memory", action="store_true", help="Measure GPU memory usage"
+)
+parser.add_argument("--device-index", type=int, default=0, help="GPU device index")
+parser.add_argument(
+    "--interval",
+    type=float,
+    default=0.5,
+    help="Interval at which measurements are collected",
+)
+args = parser.parse_args()
+device_idx = args.device_index
+interval = args.interval
+
+
+def measure_memory(func: Callable[[], None]):
+    if args.gpu_memory:
+        logger.info(
+            "Measuring maximum GPU memory usage on GPU device."
+            " Make sure to not have additional processes running on the same GPU."
+        )
+        # init nvml
+        nvml.nvmlInit()
+        handle = nvml.nvmlDeviceGetHandleByIndex(device_idx)
+        gpu_name = nvml.nvmlDeviceGetName(handle)
+        gpu_memory_limit = nvml.nvmlDeviceGetMemoryInfo(handle).total >> 20
+        gpu_power_limit = nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000.0
+        info = {"gpu_memory_usage": [], "gpu_power_usage": []}
+
+        def _get_gpu_info():
+            while True:
+                info["gpu_memory_usage"].append(
+                    nvml.nvmlDeviceGetMemoryInfo(handle).used >> 20
+                )
+                info["gpu_power_usage"].append(
+                    nvml.nvmlDeviceGetPowerUsage(handle) / 1000
+                )
+                time.sleep(interval)
+
+                if stop:
+                    break
+
+            return info
+
+        stop = False
+        thread = MyThread(_get_gpu_info, params=())
+        thread.start()
+        func()
+        stop = True
+        thread.join()
+        result = thread.get_result()
+
+        # shutdown nvml
+        nvml.nvmlShutdown()
+        max_memory_usage = max(result["gpu_memory_usage"])
+        max_power_usage = max(result["gpu_power_usage"])
+        print("GPU name: %s" % gpu_name)
+        print("GPU device index: %s" % device_idx)
+        print(
+            "Maximum GPU memory usage: %dMiB / %dMiB (%.2f%%)"
+            % (
+                max_memory_usage,
+                gpu_memory_limit,
+                (max_memory_usage / gpu_memory_limit) * 100,
+            )
+        )
+        print(
+            "Maximum GPU power usage: %dW / %dW (%.2f%%)"
+            % (
+                max_power_usage,
+                gpu_power_limit,
+                (max_power_usage / gpu_power_limit) * 100,
+            )
+        )
+    else:
+        logger.info("Measuring maximum increase of memory usage.")
+        max_usage = memory_usage(func, max_usage=True, interval=interval)
+        print("Maximum increase of RAM memory usage: %d MiB" % max_usage)
+
+
+if __name__ == "__main__":
+    measure_memory(inference)