Масштабування прогнозування відео за допомогою просторово-часових патчів

Кулик, Л. Р.; Мокін, О. Б.; Kulyk, L. R; Mokin, O. B.

dc.contributor.author	Кулик, Л. Р.	uk
dc.contributor.author	Мокін, О. Б.	uk
dc.contributor.author	Kulyk, L. R	en
dc.contributor.author	Mokin, O. B.	en
dc.date.accessioned	2026-04-14T08:23:37Z
dc.date.available	2026-04-14T08:23:37Z
dc.date.issued	2025
dc.identifier.citation	Кулик Л. Р., Мокін О. Б. Масштабування прогнозування відео за допомогою просторово-часових патчів // Вісник Вінницького політехнічного інституту. 2025. № 5. С. 129-139. URI: https://visnyk.vntu.edu.ua/index.php/visnyk/article/view/3346.	uk
dc.identifier.issn	1997-9274
dc.identifier.uri	https://ir.lib.vntu.edu.ua//handle/123456789/51136
dc.description.abstract	The article presents a new architecture for video data processing, the Vision Byte Latent Transformer (V-BLT), which adapts the principles of successful byte-level language models to the visual modality. Unlike standard approaches that use fixed-size patching, which are computationally inefficient due to the uniform resource allocation regardless of visual content complexity, V-BLT operates directly on the video byte stream. This allows for avoiding information loss associated with prior tokenization and enhances processing flexibility. The key contributions include the concept of spatiotemporal latent patches,the implementation of N-dimensional Rotary Positional Embeddings to preserve data coherence in the flattened byte stream, and a multi-level transformer architecture for hierarchical processing. To validate the hypothesis and test the model, a new synthetic dataset with rotating 2D and 3D shapes was developed for a controlled evaluation of the model’s spatiotemporal reasoning capabilities. It is experimentally demonstrated that V-BLT effectively predicts future frames, achieving high scores on MSE, SSIM, and PSNR metrics comparing to ViViT and UNet3D with better computational efficiency. The developed architecture according to the design has the ability to generate per-pixel entropy maps that visualize prediction uncertainty and correlate with dynamically complex regions of the scene. This paves the way for the implementation of dynamic, con-tent-dependent, on-the-fly allocation of computational resources, representing a promising direction for creating more effi-cient and scalable foundation models for video analytics.	en
dc.description.abstract	Запропоновано нову архітектуру для обробки відеоданих, Vision Byte Latent Transformer (V-BLT), яка адаптує принципи успішних байт-рівневих мовних моделей до зорової модальності. На відміну від стандартних підходів, що використовують пакування фіксованого розміру (patching), які є обчислювально неефективними через рівномірний розподіл ресурсів незалежно від складності візуального контенту, V-BLT працює безпосередньо з потоком байтів відео. Це дозволяє уникнути втрати інформації, пов’язаної з попередньою токенізацією, та підвищити гнучкість обробки. Ключовими внесками роботи є розробка концепції просторово-часових латентних патчів, впровадження N-вимірних ротаційних позиційних вкладень для збереження когерентності даних у розгорнутому потоці байтів, та застосування багаторівневої трансформерної архітектури для ієрархічної обробки даних. Для валідації гіпотези та тестування моделі розроблено новий синтетичний набір даних з 2D та 3D фігурами, що обертаються, який дозволяє проводити контрольовану оцінку здатності моделі до просторово-часового мислення. Експериментально продемонстровано, що V-BLT ефективно прогнозує майбутні кадри, досягаючи високих показників за метриками MSE, SSIM та PSNR в порівнянні з ViViT та UNet3D, при цьому демонструючи вищу ефективність розрахунків. Розроблена архітектура згідно з дизайном має можливість генерувати піксельні карти ентропії, які візуалізують невизначеність прогнозу та корелюють з динамічно складними регіонами сцени. Це відкриває шлях до реалізації динамічного, залежного від контенту, розподілу обчислювальних ресурсів «на ходу», що є перспективним напрямком для створення ефективніших та масштабованих фундаментних моделей для відеоаналітики.	uk
dc.language.iso	uk_UA	uk_UA
dc.publisher	ВНТУ	uk
dc.relation.ispartof	Вісник Вінницького політехнічного інституту. № 5 : 129-139.	uk
dc.relation.uri	https://visnyk.vntu.edu.ua/index.php/visnyk/article/view/3346
dc.subject	машинне навчання	uk
dc.subject	нейронні мережі	uk
dc.subject	обробка природної мови	uk
dc.subject	трансформери	uk
dc.subject	комп’ютерний зір	uk
dc.subject	згорткова нейронна мережа	uk
dc.subject	варіаційний автоенкодер	uk
dc.subject	синтетичні дані	uk
dc.subject	оптимізація	uk
dc.subject	штучні нейронні мережі	uk
dc.subject	machine learning	en
dc.subject	neural networks	en
dc.subject	natural language processing (NLP)	en
dc.subject	transformers	en
dc.subject	computer vision (CV)	en
dc.subject	convolutional neural networks (CNN)	en
dc.subject	variational autoencoder (VAE)	en
dc.subject	synthetic data	en
dc.subject	optimization	en
dc.subject	artificial neutral networks	en
dc.title	Масштабування прогнозування відео за допомогою просторово-часових патчів	uk
dc.title.alternative	Scaling Video Prediction with Spatio-Temporal Patches	en
dc.type	Article, professional native edition
dc.type	Article
dc.identifier.udc	004.054:[004.032.26+004.85]
dc.relation.references	A. Arnab, et al., “ViViT: A Video Vision Transformer,” in ArXiv e-prints, 2021. [Online]. Available: https://arxiv.org/abs/2103.15691 . Accessed: September 26, 2025.	en
dc.relation.references	A. Dosovitskiy, et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ArXiv e-prints, 2020. [Online]. Available: https://arxiv.org/abs/2010.11929 . Accessed: September 26, 2025.	en
dc.relation.references	Z. Liu, et al., “Video Swin Transformer,” in ArXiv e-prints, 2022. [Online]. Available: https://arxiv.org/abs/2106.13230 .	en
dc.relation.references	A. Pagnoni, R. et al., “Byte Latent Transformer: Patches Scale Better than Tokens,” in ArXiv e-prints, 2024. [Online]. Available: https://arxiv.org/abs/2412.09871 . Accessed: September 26, 2025.	en
dc.relation.references	L. Xue, A. Barua, et al., “ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models,” ArXiv e-prints, 2021. [Online]. Available: https://arxiv.org/abs/2105.13626 . Accessed: September 26, 2025.	en
dc.relation.references	Л. Р. Кулик, і О. Б. Мокін, «Створення синтетичного набору даних для оцінювання архітектур нейромережевих моделей,» в Матеріали LIV науково-технічної конференції підрозділів ВНТУ, Вінниця, 24-27 березня 2025 р.	uk
dc.relation.references	G. Aleksandrowicz, and G. Barequet, “Counting polycubes without the dimensionality curse,” Discrete Mathematics, vol. 309, no. 13, pp. 4576-4583, 2009. https://doi.org/10.1016/j.disc.2009.02.023 . Accessed: September 26, 2025.	en
dc.relation.references	D. Tran, et al., “A Closer Look at Spatiotemporal Convolutions for Action Recognition,” in ArXiv e-prints, 2018. [Online]. Available: https://arxiv.org/abs/1711.11248 . Accessed: September 26, 2025.	en
dc.relation.references	W. Yan, et al., “VideoGPT: Video Generation using VQ-VAE and Transformers,” in ArXiv e-prints, 2021. [Online]. Available: https://arxiv.org/abs/2104.10157 . Accessed: September 26, 2025	en
dc.relation.references	J. Ho, et al., “Video Diffusion Models,” in АrXiv e-prints, 2022. [Online]. Available: https://arxiv.org/abs/2204.03458 . Accessed: September 26, 2025.	en
dc.relation.references	A. Blattmann, et al., “Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models,” in ArXiv eprints, 2023. [Online]. Available: https://arxiv.org/abs/2304.08818 . Accessed: September 26, 2025.	en
dc.relation.references	J. Su, et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding,” in arXiv e-prints, 2021. [Online]. Available: https://arxiv.org/abs/2104.09864 . Accessed: September 26, 2025.	en
dc.relation.references	A. F. Bobick, and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, 2001.	en
dc.relation.references	Python Software Foundation, Python Language Reference, version 3.12. [Online]. Available: https://www.python.org . Accessed: September 26, 2025.	en
dc.relation.references	C. Sullivan, and B. E. A. Larson, PyVista: 3D plotting and mesh analysis through a streamlined interface for the Visualization Toolkit (VTK). [Online]. Available: https://pyvista.org . Accessed: September 26, 2025.	en
dc.relation.references	Simple Shape Dataset Toolbox GitHub. [Online]. Available: https://github.com/leo27heady/simple-shape-datasettoolbox. Accessed: September 26, 2025.	en
dc.relation.references	A. Vaswani, et al., “Attention Is All You Need,” in ArXiv e-prints, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762 . Accessed: September 26, 2025.	en
dc.relation.references	I. Loshchilov, and F. Hutter, “Decoupled Weight Decay Regularization,” in ArXiv e-prints, 2017. [Online]. Available: https://arxiv.org/abs/1711.05101 . Accessed: September 26, 2025	en
dc.relation.references	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.	en
dc.relation.references	A. Paszke, et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32, 2019, pp. 8024-8035.	en
dc.relation.references	Vision Byte Latent Transformer GitHub. [Online]. Available: https://github.com/leo27heady/visionBLT . Accessed: September 26, 2025	en
dc.relation.references	W. Kay, et al., “The Kinetics Human Action Video Dataset,” in ArXiv e-prints, 2017. [Online]. Available: https://arxiv.org/abs/1705.06950 . Accessed: September 26, 2025.	en
dc.relation.references	K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild,” in ArXiv e-prints, 2012. [Online]. Available: https://arxiv.org/abs/1212.0402 . Accessed: September 26, 2025.	en
dc.relation.references	Tan C, et al., “OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning,” in arXiv e-prints, 2022. [Online]. Available: https://arxiv.org/abs/2306.11249 . Accessed: September 26, 2025.	en
dc.relation.references	Rope-Nd GitHub. [Online]. Available: https://github.com/limefax/rope-nd . Accessed: September 26, 2025.	en
dc.relation.references	Ozgun Cicek, et al., “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” in ArXiv e-prints, 2016. [Online]. Available: https://arxiv.org/abs/1606.06650 . Accessed: September 26, 2025.	en
dc.relation.references	M. Havrylovych, and V. Danylov, “Research on hybrid transformer-based autoencoders for user biometric verification,” System Research and Information Technologies, no. 3, pp. 42-53, 2023. [Online]. Available: https://doi.org/10.20535/SRIT.2308- 8893.2023.3.03 . Accessed: September 26, 2025.	en
dc.relation.references	Vasyl Lytvyn, et al., “Detection of Similarity Between Images Based on Contrastive Language-Image Pre-Training Neural Network,” Machine Learning Workshop at CoLInS, 2024. [Online]. Available: https://doi.org/10.31110/COLINS/2024-1/008 . Accessed: September 26, 2025.	en
dc.identifier.doi	https://doi.org/10.31649/1997-9266-2025-182-5-129-139

Файли в цьому документі

Ім'я:: 191699.pdf
Розмір:: 631.0Kb
Формат:: PDF

Відкрити

Даний документ включений в наступну(і) колекцію(ї)

Вісник Вінницького політехнічного інституту. 2025. № 5 [24]

Показати скорочену інформацію