Improving the efficiency of Whisper-based audio stream processing with CTranslate2 and FFMpeg tools

Radin, V.; Riabyi, M.; Радін, В.; Рябий, М.

dc.contributor.author	Radin, V.	en
dc.contributor.author	Riabyi, M.	en
dc.contributor.author	Радін, В.	uk
dc.contributor.author	Рябий, М.	uk
dc.date.accessioned	2026-06-12T08:26:55Z
dc.date.available	2026-06-12T08:26:55Z
dc.date.issued	2026
dc.identifier.citation	Radin V., Riabyi M. Improving the efficiency of Whisper-based audio stream processing with CTranslate2 and FFMpeg tools // Information Technologies and Computer Engineering. 2026. № 1. Р. 110-124. URI: https://itce.vn.ua/uk/journals/t-23-1-2026/pidvishchennya-efektivnosti-obrobki-audiopotokiv-na-bazi-whisper-z-instrumentami-ctranslate2-ta-ffmpeg.	en
dc.identifier.issn	1999-9941
dc.identifier.uri	https://ir.lib.vntu.edu.ua//handle/123456789/51810
dc.description.abstract	The relevance of the study lies in the need to increase the performance and scalability of automatic speech recognition systems on devices with limited resources, which determines the goal of the work – to optimise Whisper by integrating CTranslate2 to accelerate calculations and FFmpeg for unified preparation of audio data. Experimental studies were conducted using the Whisper Turbo model on a graphics processor unit with support for the Compute Unified Device Architecture (CUDA) platform. The basic pipeline in the Python programming language, the optimised inference execution mechanism via CTranslate2 and the configuration with hybrid quantisation in the int8_float16 format were compared. The efficiency was evaluated using the indicators of prediction (inference) execution time, video memory use, and automatic speech recognition accuracy (Word Error Rate). Experimental results showed that the basic Whisper Turbo configuration provided the highest recognition accuracy (Word Error Rate = 0), but was characterised by high inference latency (8.5 s per audio file) and significant video memory consumption (4.9 GB). CTranslate2 integration reduced the processing time to 4.9 s (1.7× speedup) and reduced Video Random Access Memory usage to 1.8 GB (-63%) without loss of quality. Further application of hybrid quantisation int8_float16 provided a reduction of inference time to 3.8 s and a reduction of memory consumption to 1 GB, which corresponds to an overall speedup of about 2.2× and an almost fivefold (4.9×) reduction in Video Random Access Memory requirements compared to the standard implementation, with unchanged Word Error Rate = 0. The obtained results confirmed the effectiveness of the combination of CTranslate2 and hybrid quantisation for building high-performance real-time Automatic Speech Recognition systems without compromising accuracy. The conclusions confirmed the practical suitability of the proposed configuration for multi-user services and edge scenarios without compromising speed and accuracy. The results of the study can be used by developers of automatic speech recognition systems to optimise models on memory-limited GPUs, and by companies providing streaming audio and multi-user services.	en
dc.description.abstract	Актуальність дослідження полягає в необхідності підвищити продуктивність і масштабованість систем автоматичного розпізнавання мовлення на пристроях із обмеженими ресурсами, що обумовлює мету роботи – оптимізувати Whisper за допомогою інтеграції CTranslate2 для прискорення обчислень та FFmpeg для уніфікованої підготовки аудіоданих. Експериментальні дослідження проводилися з використанням моделі Whisper Turbo на графічному процесорі з підтримкою платформи обчислень Compute Unified Device Architecture. Порівнювалися базовий конвеєр на мові програмування Python, оптимізований механізм виконання інференсу через CTranslate2 та конфігурація з гібридною квантизацією у форматі int8_float16. Ефективність оцінювалася за показниками часу виконання передбачення (інференсу), використання відеопам’яті та точності автоматичного розпізнавання мовлення (Word Error Rate). Експериментальні результати показали, що базова конфігурація Whisper Turbo забезпечувала максимальну точність розпізнавання (Word Error Rate = 0), однак характеризувалася високою затримкою інференсу (8,5 с на аудіофайл) і значним споживанням відеопам’яті (4,9 ГБ). Інтеграція CTranslate2 скоротила час обробки до 4,9 с (прискорення 1,7×) та зменшила використання Video Random Access Memory до 1,8 ГБ (-63 %) без втрати якості. Подальше застосування гібридної квантизації int8_float16 забезпечило зниження часу інференсу до 3,8 с і скорочення споживання пам’яті до 1 ГБ, що відповідає загальному прискоренню близько 2,2× та майже п’ятикратному (4,9×) зменшенню вимог до Video Random Access Memory порівняно зі стандартною реалізацією, при незмінному Word Error Rate = 0. Отримані результати підтвердили ефективність поєднання CTranslate2 і гібридної квантизації для побудови високопродуктивних систем Automatic Speech Recognition реального часу без компромісу в точності. Висновки підтвердили практичну придатність запропонованої конфігурації для багатокористувацьких сервісів і edge-сценаріїв без компромісу між швидкістю та точністю. Результати дослідження можуть бути використані розробниками систем автоматичного розпізнавання мовлення для оптимізації моделей на графічних процесорах з обмеженим обсягом пам’яті, компаніями, що надають потокові аудіо- та багатокористувацькі сервіси	uk
dc.language.iso	en_US	en_US
dc.publisher	ВНТУ	en
dc.relation.ispartof	Information Technologies and Computer Engineering. № 1 : 110-124.	en
dc.relation.uri	https://itce.vn.ua/uk/journals/t-23-1-2026/pidvishchennya-efektivnosti-obrobki-audiopotokiv-na-bazi-whisper-z-instrumentami-ctranslate2-ta-ffmpeg
dc.subject	квантизація	uk
dc.subject	автоматичне розпізнавання мовлення	uk
dc.subject	ф’юзування операторів	uk
dc.subject	відеопам’ять	uk
dc.subject	ресурсоефективність	uk
dc.subject	quantisation	en
dc.subject	automatic speech recognition	en
dc.subject	operator fusion	en
dc.subject	video memory	en
dc.subject	resource efficiency	en
dc.title	Improving the efficiency of Whisper-based audio stream processing with CTranslate2 and FFMpeg tools	en
dc.title.alternative	Підвищення ефективності обробки аудіопотоків на базі Whisper з інструментами CTranslate2 та FFMpeg	uk
dc.type	Article, professional native edition
dc.type	Article
dc.identifier.udc	004.934.2:004.42
dc.relation.references	Ala-Rantala, J. (2025). Low-latency voice-guided visual content generation using generative AI models. (Master’s thesis, Tampere University, Tampere, Finland).	en
dc.relation.references	Cao, Y. (2025). Performance evaluation of whisper-series speech transcription models on raspberry Pi. In Proceedings of the tenth ACM/IEEE symposium on edge computing (article number 59). New York: ACM. doi: 10.1145/3769102.3774244.	en
dc.relation.references	Chettiar, F.F., Lahrani, H., & Rathor, K. (2025). Multilingual video translation and speech synthesis: A deep learning approach for seamless language adaptation. In Proceedings of the international conference on interdisciplinary approaches in technology and management for social innovation (pp. 1-6). Gwalior: IEEE. doi: 10.1109/ IATMSI64286.2025.10985230.	en
dc.relation.references	Ebrahimipour, S.M., Mozafari, S.H., Clark, J.J., Gross, W.J., & Meyer, B.H. (2025). Latency-aware pruning and quantization of self-supervised speech transformers for edge devices. ACM Transactions on Embedded Computing Systems. doi: 10.1145/3746638.	en
dc.relation.references	El Bahri, J., Kouissi, M., & Begdouri, M.A. (2025). Sustainable speech recognition: Energy, carbon, and performance comparison of whisper (base and large) and google speech-to-text V2 (Chirp/USM). In H. Gibet Tani, M. Kouissi, M. Ben Ahmed, B.A. Abdelhakim & L. Elaachak (Eds.), Energy-efficient algorithms and systems in computing: Optimizing performance and sustainability through advanced computational methods (pp. 213-226). Cham: Springer. doi: 10.1007/978-3-032-04114-2_14.	en
dc.relation.references	Feng, C., Lin, Y., Zhuo, S., Su, C., Ramakrishnan, R.K., Yuan, Z., & Zhang, X. (2025). Edge-ASR: Towards low-bit quantization of automatic speech recognition models. ArXiv. doi: 10.48550/arXiv.2507.07877.	en
dc.relation.references	Hung, N.T., Phuc, V.H., Dung, N.T., Duc, L.X., Nhu, M.T., & Van, P.T. (2025). Effwhis: A proposed efficient approach for speech-to-text streaming whisper. In Proceedings of the 7th international conference on knowledge and system engineering (pp. 1-6). Da Lat: IEEE. doi: 10.1109/KSE68178.2025.11309493.	en
dc.relation.references	Hwang, M.H., Shin, J., & Bang, J. (2026). V-APA: A voice-driven agentic process automation system. Computer Speech & Language, 99, article number 101938. doi: 10.1016/j.csl.2026.101938.	en
dc.relation.references	Kalhoro, M.M., & Masab, M. (2025). Light-weight online real-time ASR: A bit more attention is needed. Authorea Preprints. doi: 10.22541/au.174914695.58777421/v1.	en
dc.relation.references	Kasoju, A., & Vishwakarma, T. (2025). Optimizing transformer models for low-latency inference: Techniques, architectures, and code implementations. International Journal of Science and Research, 14, 857-866. doi: 10.21275/SR25409073105.	en
dc.relation.references	Khadse, S. (2025). Small language models and efficient AI: The future of sustainable, accessible intelligence a comprehensive analysis of model compression, edge deployment, and resource-efficient AI systems. SSRN. doi: 10.2139/ssrn.5664971.	en
dc.relation.references	Kim, S. (2024). Full stack approach for efficient deep learning inference. (Doctoral dissertation, University of California, Berkeley, USA).	en
dc.relation.references	Maurya, M., Zaheer, M., Mohammad, N., Siddiqui, S., Khan, M., & Akram M. (2025). Speech recognition technologies: Design, challenges, and real-world applications. International Journal of Innovative Research in Computer Science and Technology, 13(3), 55-61. doi: 10.55524/ijircst.2025.13.3.9.	en
dc.relation.references	Menshawy, A., & Fahmy, M. (2025). LLMs in Enterprise: Design strategies, patterns, and best practices for large language model development. Birmingham: Packt Publishing Ltd.	en
dc.relation.references	Moslem, Y., Morán, J.J., Gonzalez-Gomez, M., Al Farouq, M.H., Abdou, F., & Deb, S. (2025). SpeechT: Findings of the first mentorship in speech translation. In Proceedings of machine translation summit 20th (Vol. 2, pp. 67-74). Geneva: European Association for Machine Translation.	en
dc.relation.references	Mrozek, Ye. (2024). Analysis of modern approaches to speech recognition tasks. Control Systems & Computers, 4(308), 39-49. doi: 10.15407/csc.2024.04.039.	en
dc.relation.references	Nakhod, O. (2025). Automatic recognition of Ukrainian speech based on deep learning. Collection of Scientific Papers “ΛΌГOΣ”, 24, 218-220. doi: 10.36074/logos-24.01.2025.043.	en
dc.relation.references	Orhon, A., Okan, A., Durmus, B., Nagengast, Z., & Pacheco, E. (2025). WhisperKit: On-device real-time ASR with billion-scale transformers. ArXiv. doi: 10.48550/arXiv.2507.10860.	en
dc.relation.references	Potocnik, V., Colagrande, L., Fischer, T., Bertaccini, L., Pagliari, D.J., Burrello, A., & Benini, L. (2024). Optimizing foundation model inference on a many-tiny-core open-source risc-v platform. IEEE Transactions on Circuits and Systems for Artificial Intelligence, 1(1), 37-52. doi: 10.1109/TCASAI.2024.3459412.	en
dc.relation.references	Rangappa, P., et al. (2025). Speech data selection for efficient ASR fine-tuning using domain classifier and pseudo-label filtering. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1-5). Hyderabad: IEEE. doi: 10.1109/ICASSP49660.2025.10888138.	en
dc.relation.references	Thorbecke, I., Zuluaga-Gomez, J.P., Villatoro-Tello, E., Kumar, S., Rangappa, P., Burdisso, S., Motlicek, P., Pandia, K., & Ganapathiraju, A. (2024). Fast streaming transducer ASR prototyping via knowledge distillation with whisper. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 16747-16762). Miami: Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.976.	en
dc.relation.references	Trabelsi, A., Werey, L., Warichet, S., & Helbert, E. (2024). Is noise reduction improving open-source ASR transcription engines quality? In Proceedings of the 16th international conference on agents and artificial intelligence (pp. 1221-1228). Rome: Science and Technology Publications. doi: 10.5220/0012457100003636.	en
dc.relation.references	Vergallo, R., Aprile, M., Cruz, L., Vadacca, R., & Mainetti, L. (2025). Large-scale evaluation of quantization for reducing the energy footprint of deep learning models. SSRN. doi: 10.2139/ssrn.5719661.	en
dc.relation.references	Wang, N., Liu, C.C., Venkataramani, S., Sen, S., Chen, C.Y., El Maghraoui, K., Srinivasan, V., & Chang, L. (2022). Deep compression of pre-trained transformer models. In Proceedings of the 36th international conference on neural information processing systems (pp. 14140-14154). Ney York: Curran Associates.	en
dc.relation.references	Wu, C., Pan, Y., Wu, H., & Ning, L. (2025). Integrating speech recognition into intelligent information systems: From statistical models to deep learning. Informatics, 12(4), article number 107. doi: 10.3390/informatics12040107.	en
dc.relation.references	Wu, X., Zhang, Y., & Feng, B. (2023). English pronunciation quality evaluation system based on continuous speech recognition technology for multi-terminal. Journal of Physics: Conference Series, 2632, article number 012024. doi: 10.1088/1742-6596/2632/1/012024.	en
dc.relation.references	Znotins, A., Gosko, D., & Gruzitis, N. (2025). LATE: Open source toolkit for Latvian and latgalian speech transcription. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH (pp. 306-307). Rotterdam: ISCA.	en
dc.identifier.doi	https://doi.org/10.31649/vitce/1.2026.110
dc.identifier.orcid	https://orcid.org/0009-0009-1101-6888
dc.identifier.orcid	/https://orcid.org/0000-0002-9651-9135

Files in this item

Name:: 202696.pdf
Size:: 3.272Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Інформаційні технології та комп'ютерна інженерія. 2026. № 1 [13]

Show simple item record