Skip to content

Commit

Permalink
Merge pull request #11 from mobiusml/fw_compliance
Browse files Browse the repository at this point in the history
Adding further changes before PR
  • Loading branch information
Jiltseb authored May 24, 2024
2 parents 3f27636 + 2dde3c9 commit 8fd2ec0
Show file tree
Hide file tree
Showing 16 changed files with 2,079 additions and 50 deletions.
23 changes: 15 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,28 +75,35 @@ Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. T

GPU execution requires the following NVIDIA libraries to be installed:

* [cuBLAS for CUDA 11](https://developer.nvidia.com/cublas)
* [cuDNN 8 for CUDA 11](https://developer.nvidia.com/cudnn)
* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
* [cuDNN 8 for CUDA 12](https://developer.nvidia.com/cudnn)

There are multiple ways to install these libraries. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below.
**Note**: Latest versions of `ctranslate2` support CUDA 12 only. For CUDA 11, the current workaround is downgrading to the `3.24.0` version of `ctranslate2` (This can be done with `pip install --force-reinstall ctranslate2==3.24.0` or specifying the version in a `requirements.txt`).

There are multiple ways to install the NVIDIA libraries mentioned above. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below.

<details>
<summary>Other installation methods (click to expand)</summary>


**Note:** For all these methods below, keep in mind the above note regarding CUDA versions. Depending on your setup, you may need to install the _CUDA 11_ versions of libraries that correspond to the CUDA 12 libraries listed in the instructions below.

#### Use Docker

The libraries are installed in this official NVIDIA Docker image: `nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04`.
The libraries (cuBLAS, cuDNN) are installed in these official NVIDIA CUDA Docker images: `nvidia/cuda:12.0.0-runtime-ubuntu20.04` or `nvidia/cuda:12.0.0-runtime-ubuntu22.04`.

#### Install with `pip` (Linux only)

On Linux these libraries can be installed with `pip`. Note that `LD_LIBRARY_PATH` must be set before launching Python.

```bash
pip install nvidia-cublas-cu11 nvidia-cudnn-cu11
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12

export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
```

**Note**: Version 9+ of `nvidia-cudnn-cu12` appears to cause issues due its reliance on cuDNN 9 (Faster-Whisper does not currently support cuDNN 9). Ensure your version of the Python package is for cuDNN 8.

#### Download the libraries from Purfview's repository (Windows & Linux)

Purfview's [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) provides the required NVIDIA libraries for Windows & Linux in a [single archive](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Decompress the archive and place the libraries in a directory included in the `PATH`.
Expand Down Expand Up @@ -162,7 +169,7 @@ segments = list(segments) # The transcription will actually run here.

### multi-segment language detection

To directly use the model for improved language detection, following code snippet can be used:
To directly use the model for improved language detection, the following code snippet can be used:

```python
from faster_whisper import WhisperModel
Expand All @@ -172,10 +179,10 @@ language_info = model.detect_language_multi_segment("audio.mp3")

### Batched faster-whisper

The batched version of faster-whisper is inspired by [whisper-x](https://github.com/m-bain/whisperX) licensed under the BSD-4 Clause license. This product includes software developed by Max Bain. We modify this implementation and also added kaldi-based feature extraction. It improves the speed upto 10-12x compared to openAI implementation. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference.

The batched version of faster-whisper is inspired by [whisper-x](https://github.com/m-bain/whisperX) licensed under the BSD-4 Clause license. This product includes software developed by Max Bain. We modify this implementation and also added kaldi-based feature extraction. It improves the speed upto 10-12x compared to openAI implementation and 3-4x compared to the sequential faster_whisper version. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference.

The following code snippet illustrates how to run inference with batched version on a specified audio file. Please also refer to the test scripts of batched faster whisper.
The following code snippet illustrates how to run inference with batched version on an example audio file. Please also refer to the test scripts of batched faster whisper.

```python
from faster_whisper import BatchedInferencePipeline
Expand Down
Binary file added benchmark/benchmark.m4a
Binary file not shown.
94 changes: 94 additions & 0 deletions benchmark/memory_benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import argparse
import time

from typing import Callable

import py3nvml.py3nvml as nvml

from memory_profiler import memory_usage
from utils import MyThread, get_logger, inference

logger = get_logger("faster-whisper")
parser = argparse.ArgumentParser(description="Memory benchmark")
parser.add_argument(
"--gpu_memory", action="store_true", help="Measure GPU memory usage"
)
parser.add_argument("--device-index", type=int, default=0, help="GPU device index")
parser.add_argument(
"--interval",
type=float,
default=0.5,
help="Interval at which measurements are collected",
)
args = parser.parse_args()
device_idx = args.device_index
interval = args.interval


def measure_memory(func: Callable[[], None]):
if args.gpu_memory:
logger.info(
"Measuring maximum GPU memory usage on GPU device."
" Make sure to not have additional processes running on the same GPU."
)
# init nvml
nvml.nvmlInit()
handle = nvml.nvmlDeviceGetHandleByIndex(device_idx)
gpu_name = nvml.nvmlDeviceGetName(handle)
gpu_memory_limit = nvml.nvmlDeviceGetMemoryInfo(handle).total >> 20
gpu_power_limit = nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000.0
info = {"gpu_memory_usage": [], "gpu_power_usage": []}

def _get_gpu_info():
while True:
info["gpu_memory_usage"].append(
nvml.nvmlDeviceGetMemoryInfo(handle).used >> 20
)
info["gpu_power_usage"].append(
nvml.nvmlDeviceGetPowerUsage(handle) / 1000
)
time.sleep(interval)

if stop:
break

return info

stop = False
thread = MyThread(_get_gpu_info, params=())
thread.start()
func()
stop = True
thread.join()
result = thread.get_result()

# shutdown nvml
nvml.nvmlShutdown()
max_memory_usage = max(result["gpu_memory_usage"])
max_power_usage = max(result["gpu_power_usage"])
print("GPU name: %s" % gpu_name)
print("GPU device index: %s" % device_idx)
print(
"Maximum GPU memory usage: %dMiB / %dMiB (%.2f%%)"
% (
max_memory_usage,
gpu_memory_limit,
(max_memory_usage / gpu_memory_limit) * 100,
)
)
print(
"Maximum GPU power usage: %dW / %dW (%.2f%%)"
% (
max_power_usage,
gpu_power_limit,
(max_power_usage / gpu_power_limit) * 100,
)
)
else:
logger.info("Measuring maximum increase of memory usage.")
max_usage = memory_usage(func, max_usage=True, interval=interval)
print("Maximum increase of RAM memory usage: %d MiB" % max_usage)


if __name__ == "__main__":
measure_memory(inference)
Loading

0 comments on commit 8fd2ec0

Please sign in to comment.