Improving the efficiency of Whisper-based audio stream processing with CTranslate2 and FFMpeg tools

Radin, V.; Riabyi, M.; Радін, В.; Рябий, М.

Автор

Radin, V.

Riabyi, M.

Радін, В.

Рябий, М.

Дата

2026

Metadata

Показати повну інформацію

Collections

Інформаційні технології та комп'ютерна інженерія. 2026. № 1 [13]

Анотації

The relevance of the study lies in the need to increase the performance and scalability of automatic speech recognition systems on devices with limited resources, which determines the goal of the work – to optimise Whisper by integrating CTranslate2 to accelerate calculations and FFmpeg for unified preparation of audio data. Experimental studies were conducted using the Whisper Turbo model on a graphics processor unit with support for the Compute Unified Device Architecture (CUDA) platform. The basic pipeline in the Python programming language, the optimised inference execution mechanism via CTranslate2 and the configuration with hybrid quantisation in the int8_float16 format were compared. The efficiency was evaluated using the indicators of prediction (inference) execution time, video memory use, and automatic speech recognition accuracy (Word Error Rate). Experimental results showed that the basic Whisper Turbo configuration provided the highest recognition accuracy (Word Error Rate = 0), but was characterised by high inference latency (8.5 s per audio file) and significant video memory consumption (4.9 GB). CTranslate2 integration reduced the processing time to 4.9 s (1.7× speedup) and reduced Video Random Access Memory usage to 1.8 GB (-63%) without loss of quality. Further application of hybrid quantisation int8_float16 provided a reduction of inference time to 3.8 s and a reduction of memory consumption to 1 GB, which corresponds to an overall speedup of about 2.2× and an almost fivefold (4.9×) reduction in Video Random Access Memory requirements compared to the standard implementation, with unchanged Word Error Rate = 0. The obtained results confirmed the effectiveness of the combination of CTranslate2 and hybrid quantisation for building high-performance real-time Automatic Speech Recognition systems without compromising accuracy. The conclusions confirmed the practical suitability of the proposed configuration for multi-user services and edge scenarios without compromising speed and accuracy. The results of the study can be used by developers of automatic speech recognition systems to optimise models on memory-limited GPUs, and by companies providing streaming audio and multi-user services.

Актуальність дослідження полягає в необхідності підвищити продуктивність і масштабованість систем автоматичного розпізнавання мовлення на пристроях із обмеженими ресурсами, що обумовлює мету роботи – оптимізувати Whisper за допомогою інтеграції CTranslate2 для прискорення обчислень та FFmpeg для уніфікованої підготовки аудіоданих. Експериментальні дослідження проводилися з використанням моделі Whisper Turbo на графічному процесорі з підтримкою платформи обчислень Compute Unified Device Architecture. Порівнювалися базовий конвеєр на мові програмування Python, оптимізований механізм виконання інференсу через CTranslate2 та конфігурація з гібридною квантизацією у форматі int8_float16. Ефективність оцінювалася за показниками часу виконання передбачення (інференсу), використання відеопам’яті та точності автоматичного розпізнавання мовлення (Word Error Rate). Експериментальні результати показали, що базова конфігурація Whisper Turbo забезпечувала максимальну точність розпізнавання (Word Error Rate = 0), однак характеризувалася високою затримкою інференсу (8,5 с на аудіофайл) і значним споживанням відеопам’яті (4,9 ГБ). Інтеграція CTranslate2 скоротила час обробки до 4,9 с (прискорення 1,7×) та зменшила використання Video Random Access Memory до 1,8 ГБ (-63 %) без втрати якості. Подальше застосування гібридної квантизації int8_float16 забезпечило зниження часу інференсу до 3,8 с і скорочення споживання пам’яті до 1 ГБ, що відповідає загальному прискоренню близько 2,2× та майже п’ятикратному (4,9×) зменшенню вимог до Video Random Access Memory порівняно зі стандартною реалізацією, при незмінному Word Error Rate = 0. Отримані результати підтвердили ефективність поєднання CTranslate2 і гібридної квантизації для побудови високопродуктивних систем Automatic Speech Recognition реального часу без компромісу в точності. Висновки підтвердили практичну придатність запропонованої конфігурації для багатокористувацьких сервісів і edge-сценаріїв без компромісу між швидкістю та точністю. Результати дослідження можуть бути використані розробниками систем автоматичного розпізнавання мовлення для оптимізації моделей на графічних процесорах з обмеженим обсягом пам’яті, компаніями, що надають потокові аудіо- та багатокористувацькі сервіси

URI:

https://ir.lib.vntu.edu.ua//handle/123456789/51810

Відкрити

202696.pdf (3.272Mb)