GPU hurts performance for large models #1254

DKWoods · 2025-02-21T18:44:45Z

DKWoods
Feb 21, 2025

I've been using faster whisper for a while, and finally figured out how to use cuda. Performance is much better (cuda vs. cpu, same data file) with the tiny, base, small, and medium models, but much worse with large, large-v1, and large-v2.

The following data is for a 3:20 audio file. Time is in seconds to generate all segments.

Model	CPU	CUDA
Tiny	12.4	5
Base	20.8	6
Small	58	10.8
Medium	169.2	25.4
Large	302.4	1065.5
Large-v1	283.5	974.6
Large-v2	303.2	1020.7

So smaller models show an improved speed of 2 to 5 times using cuda over cpu, while large models show cuda taking roughly 3 times longer using cuda than cpu. I've tested this with multiple files with essentially the same results.

I'm using Windows 11, an NVidia GTX 1660 Ti video card (it's old, but it's what I have), python 3.11, and both faster-whisper 1.1.1 and 0.9.0.

So is this due to my old video card? Am I missing something fundamental? Do others see this too? Any thoughts or suggestions would be greatly appreciated.

Here's my code:

import os, sys, traceback
import time
import faster_whisper

# Change the data file and model path for your computer!
datafile = 'V:\\Transana\\waveforms\\Faster_Whisper\\gdn.bus.090625.tm.Heather-Stewart2.wav'
#datafile = "V:\\Transana\\waveforms\\Faster_Whisper\\Jeanine's Breakfast.wav"
modelPath = 'C:\\Users\\David\\Documents\\Transana 2'

models = ['tiny', 'base', 'small', 'medium', 'large', 'large-v1', 'large-v2']

devices = ['cuda', 'cpu']

language = 'en'
compute_type = "float32"
action = 'transcribe'
temperature = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0,]
compression_ratio = 2.4
log_prob_threshold = -1

results = {}

try:
    for modelToUse in models:
        modelDir = os.path.join(modelPath, 'faster_whisper_models', modelToUse)

        if not os.path.isdir(modelDir):
            faster_whisper.download_model(modelToUse, cache_dir=modelDir)

        for device in devices:
            model = faster_whisper.WhisperModel(modelToUse, 
                                                device=device, 
                                                compute_type=compute_type,
                                                download_root=modelDir)

            print('starting', modelToUse, device)
            print()
            startTime = time.time()

            (segments, info) = model.transcribe(datafile, 
                                                language=language,
                                                task=action,
                                                temperature=temperature,
                                                compression_ratio_threshold=compression_ratio,
                                                log_prob_threshold=log_prob_threshold, 
                                                word_timestamps=True)

            for segment in segments:
                print(segment.start, segment.text)
            print()

            elapsedTime = time.time() - startTime
            print('Model:', modelToUse , 'Device:', device, 'Elapsed Time:', elapsedTime)
            print()
            print()

            results[(modelToUse, device)] = elapsedTime
except:
    exc = sys.exc_info()
    print()
    print(exc[0])
    print(exc[1])
    traceback.print_exc()

print()
print()
for (model, device) in results.keys():
    print(model, device, results[(model, device)])

MahmoudAshraf97 · 2025-02-24T20:58:01Z

MahmoudAshraf97
Feb 24, 2025
Maintainer

This means that the model is overflowing from the VRAM to the RAM which causes huge slowdowns

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU hurts performance for large models #1254

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

GPU hurts performance for large models #1254

DKWoods Feb 21, 2025

Replies: 1 comment

MahmoudAshraf97 Feb 24, 2025 Maintainer

DKWoods
Feb 21, 2025

MahmoudAshraf97
Feb 24, 2025
Maintainer