Skip to content

Commit

Permalink
Add Whisper scripts (microsoft#17043)
Browse files Browse the repository at this point in the history
### Description
This PR adds benchmark scripts for Whisper. It is a follow-up to [this
PR](microsoft#17020) that adds the
LLaMA scripts.



### Motivation and Context
This PR enables benchmarking Whisper across various configurations.
  • Loading branch information
kunal-vaishnavi authored Aug 23, 2023
1 parent 5842144 commit 4b3477f
Show file tree
Hide file tree
Showing 5 changed files with 1,138 additions and 14 deletions.
159 changes: 149 additions & 10 deletions onnxruntime/python/tools/transformers/models/whisper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,37 @@

## Exporting Whisper with Beam Search

There are two ways to export Whisper with beam search (using Whisper tiny as an example).
There are several ways to export Whisper with beam search (using Whisper tiny as an example).

### Option 1: from convert_to_onnx

Option 1: from source
```
# From source
$ git clone https://github.com/microsoft/onnxruntime
$ cd onnxruntime/onnxruntime/python/tools/transformers/models/whisper
$ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
# From wheel
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
```

Option 2: from wheel
### Option 2: end-to-end model from [Olive](https://github.com/microsoft/Olive/tree/main/examples/whisper)

Please follow the [README instructions](https://github.com/microsoft/Olive/tree/main/examples/whisper#prerequisites) in Olive.

### Option 3: from [Hugging Face Optimum](https://github.com/huggingface/optimum)

Run the following Python code to export:

```
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
model_name = "openai/whisper-large-v2"
model = ORTModelForSpeechSeq2Seq.from_pretrained(
model_name,
export=True,
)
model.save_pretrained(model_name.split("/")[-1] + "-onnx")
```

## Exporting + Optimizing + Quantizing Whisper with Beam Search
Expand All @@ -23,7 +42,7 @@ Here are some additional examples for exporting Whisper with beam search.
Export with Forced Decoder Input Ids
```
# From source:
$ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
$ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
# From wheel:
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
Expand All @@ -32,7 +51,7 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
Export + Optimize for FP32
```
# From source:
$ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
$ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
# From wheel:
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
Expand All @@ -41,7 +60,7 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
Export + Optimize for FP16 and GPU
```
# From source:
$ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
$ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
# From wheel:
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
Expand All @@ -50,8 +69,128 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
Export + Quantize for INT8
```
# From source:
$ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
$ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
# From wheel:
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
```

## Benchmark Whisper

Here are some examples of how you can benchmark Whisper across various end-to-end (E2E) implementations.

Note: In the below examples, `PyTorch` refers to running in PyTorch without `torch.compile` and `PyTorch 2.0` refers to running in PyTorch with `torch.compile`.

### Variants

1. PyTorch (without `torch.compile`), FP32
```
python3 -m models.whisper.benchmark \
--benchmark-type hf-pt \
--audio-path 1272-141231-0002.mp3 \
--model-name openai/whisper-large-v2 \
--precision fp32 \
--device cpu
```

2. PyTorch 2.0 (with `torch.compile`), FP16
```
python3 -m models.whisper.benchmark \
--benchmark-type hf-pt2 \
--audio-path 1272-141231-0002.mp3 \
--model-name openai/whisper-large-v2 \
--precision fp16 \
--device cuda
```

3. Optimum + ONNX Runtime, FP32, export via Optimum
```
python3 -m models.whisper.benchmark \
--benchmark-type hf-ort \
--audio-path 1272-141231-0002.mp3 \
--model-name openai/whisper-large-v2 \
--hf-ort-model-path ./whisper-large-v2-onnx/ \
--precision fp32 \
--device cpu
```

4. ONNX Runtime, FP32, export via Olive or convert_to_onnx
```
python3 -m models.whisper.benchmark \
--benchmark-type ort \
--audio-path 1272-141231-0002.mp3 \
--model-name openai/whisper-large-v2 \
--ort-model-path ./wlarge-fp32/whisper-large-v2_beamsearch.onnx \
--precision fp32 \
--device cpu
```

5. ONNX Runtime, FP16, export via Olive or convert_to_onnx
```
python3 -m models.whisper.benchmark \
--benchmark-type ort \
--audio-path 1272-141231-0002.mp3 \
--model-name openai/whisper-large-v2 \
--ort-model-path ./wlarge-fp32/whisper-large_all.onnx \
--precision fp16 \
--device cuda
```

6. ONNX Runtime, INT8, export via Olive or convert_to_onnx
```
python3 -m models.whisper.benchmark \
--benchmark-type ort \
--audio-path 1272-141231-0002.mp3 \
--model-name openai/whisper-large-v2 \
--ort-model-path ./wlarge-fp32/whisper-large-v2_all.onnx \
--precision fp32 \
--device cpu
```

You can profile a variant by adding the `--profile` flag.

### Benchmark All

You can use `benchmark_all.py` to benchmark across various platforms and automatically store the results in a CSV file. Here is an example.

```
python3 -m models.whisper.benchmark_all \
--audio-path ./whisper-test-audios/ \
--hf-ort-model-path ./whisper-large-v2-onnx/ \
--ort-model-path ./wlarge-fp32/whisper-large-v2_all.onnx \
--model-name openai/whisper-large-v2 \
--precision fp32 \
--device cpu
```

### Benchmarking on NVIDIA A100

Here is a benchmark for an MP3 file with 20.7s of audio.

#### FP16

| Engine | Size | Per-Token Latency | Real-Time Factor |
| ------------- | -------- | ----------------- | ---------------- |
| PyTorch | Tiny | 4.697 ms/token | 0.004697 |
| PyTorch 2.0 | Tiny | 3.406 ms/token | 0.003406 |
| ONNX Runtime | Tiny | 0.746 ms/token | 0.000746 |
| PyTorch | Medium | 17.837 ms/token | 0.017387 |
| PyTorch 2.0 | Medium | 18.124 ms/token | 0.018124 |
| ONNX Runtime | Medium | 3.894 ms/token | 0.003894 |
| PyTorch | Large v2 | 23.470 ms/token | 0.023470 |
| PyTorch 2.0 | Large v2 | 23.146 ms/token | 0.023146 |
| ONNX Runtime | Large v2 | 6.262 ms/token | 0.006262 |

#### FP32

| Engine | Size | Per-Token Latency | Real-Time Factor |
| ------------- | -------- | ----------------- | ---------------- |
| PyTorch | Tiny | 6.220 ms/token | 0.006220 |
| PyTorch 2.0 | Tiny | 3.944 ms/token | 0.003944 |
| ONNX Runtime | Tiny | 1.545 ms/token | 0.001545 |
| PyTorch | Medium | 19.093 ms/token | 0.019093 |
| PyTorch 2.0 | Medium | 20.459 ms/token | 0.020459 |
| ONNX Runtime | Medium | 9.440 ms/token | 0.009440 |
| PyTorch | Large v2 | 25.844 ms/token | 0.025844 |
| PyTorch 2.0 | Large v2 | 26.397 ms/token | 0.026397 |
| ONNX Runtime | Large v2 | 7.492 ms/token | 0.007492 |
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# --------------------------------------------------------------------------
import os.path
import os
import sys

sys.path.append(os.path.dirname(__file__))
Expand Down
Loading

0 comments on commit 4b3477f

Please sign in to comment.