Prompt engineering for large language models in test case generation

Husakovskyi, A.; Гусаковський, А.

dc.contributor.author	Husakovskyi, A.	en
dc.contributor.author	Гусаковський, А.	uk
dc.date.accessioned	2026-06-11T12:23:01Z
dc.date.available	2026-06-11T12:23:01Z
dc.date.issued	2026
dc.identifier.citation	Husakovskyi, A. Prompt engineering for large language models in test case generation // Information Technologies and Computer Engineering. 2026. № 1. Р. 22-34. URI: https://itce.vn.ua/uk/journals/t-23-1-2026/shvidka-rozrobka-velikikh-movnikh-modeley-dlya-generatsiyi-testovikh-vipadkiv.	en
dc.identifier.issn	1999-9941
dc.identifier.uri	https://ir.lib.vntu.edu.ua//handle/123456789/51794
dc.description.abstract	The relevance of the study is determined by the need to enhance the effectiveness of software testing, the use of large language models and prompt engineering techniques opens new opportunities for the automated generation of high-quality test cases. The purpose of the study is to evaluate the effectiveness of prompt engineering strategies in test case generation by large language models. The methodology is based on a comparison of four prompt engineering techniques, namely zero-shot, few-shot, chain-of-thought, and role prompting, for unit test generation using the CodeLlama 2 and StarCoder language models in the PyTest and JUnit environments, with evaluation according to the criteria of code coverage, relevance, defect detection, and integration suitability. The analysis demonstrated that few-shot and role prompting provide the best balance between the quantity and quality of tests, with coverage of 85-100% and relevance of 88-95%, as chain-of-thought proved effective for complex logic and identified 16 of 20 embedded defects (80%), while zero-shot was limited to basic checks with coverage of 55-65% and accuracy of 70-75%. CodeLlama 2 demonstrated stable test generation with high consistency across repeated queries (90%), an average generation time of 16.2 s, and 52 tests per module, covering basic and complex scenarios, including edge cases and exceptions. StarCoder demonstrated higher speed (14.7 s), generated 50 tests with slightly lower stability (87%) and reduced coverage of complex scenarios, which rendered it effective for rapid validation of basic functions. The highest levels of readability, modularity, and integration suitability for CI/CD pipelines were observed with role prompting, as few-shot ensured a strong balance between structured output and practical test readiness, while chain-of-thought and zero-shot exhibited specific limitations. Combined use of models and prompting strategies enables optimisation of the test generation process, enhancing relevance, coverage, and the effectiveness of automated testing. The results of the study may be applied in automated software testing, integration into continuous integration and delivery pipelines, and training of quality assurance engineers in effective test generation methods.	en
dc.description.abstract	Актуальність дослідження зумовлена потребою підвищення ефективності тестування програмного забезпечення, де використання великих мовних моделей і технік інженерії підказок відкриває нові можливості для автоматизованої генерації якісних тестових випадків. Метою дослідження було оцінити ефективність стратегій prompt engineering у генерації тестових випадків великими мовними моделями. Методологія базувалася на порівнянні чотирьох технік prompt engineering: zero-shot, few-shot, chain-of-thought та role prompting для генерації unit-тестів мовними моделями CodeLlama 2 та StarCoder у середовищі PyTest і JUnit із оцінкою за критеріями покриття коду, релевантності, дефектовиявлення та інтеграційної придатності. Аналіз показав, що few-shot та role prompting забезпечують найкращий баланс між кількістю та якістю тестів із покриттям 85-100 % та релевантністю 88-95 %, тоді як chain-of-thought ефективний для складної логіки й виявив 16 із 20 закладених дефектів (80%), а zero-shot обмежений базовими перевірками з покриттям 55-65 % та точністю 70-75 %. CodeLlama 2 продемонстрував стабільну генерацію тестів із високою узгодженістю повторних запитів (90%), середнім часом генерації 16,2 с та 52 тестами на модуль, охоплюючи базові та складні сценарії, включно з крайовими випадками та винятками. StarCoder був швидшим (14,7 с), генерував 50 тестів із трохи нижчою стабільністю (87 %) і меншим покриттям складних сценаріїв, що робило його ефективним для швидкої перевірки базових функцій. Найвища читабельність, модульність і інтеграційна придатність у CI/CD-конвеєри були за role prompting, тоді як few-shot забезпечував гарний баланс між структурованістю та практичною готовністю тестів, а chain-of-thought і zero-shot мали специфічні обмеження. Комбіноване використання моделей і стратегій prompting дозволяє оптимізувати процес генерації тестів, підвищуючи їхню релевантність, покриття та ефективність автоматизованого тестування. Результати дослідження можуть застосовуватися для автоматизованого тестування програмного забезпечення, інтеграції у конвеєри безперервної інтеграції та доставки та навчання інженерів з контролю якості ефективним методам генерації тестів	uk
dc.language.iso	en_US	en_US
dc.publisher	ВНТУ	uk
dc.relation.ispartof	Information Technologies and Computer Engineering. № 1 : 22-34.	en
dc.relation.uri	https://itce.vn.ua/uk/journals/t-23-1-2026/shvidka-rozrobka-velikikh-movnikh-modeley-dlya-generatsiyi-testovikh-vipadkiv
dc.subject	CodeLlama	uk
dc.subject	StarCoder	uk
dc.subject	підказки з нульовим результатом	uk
dc.subject	підказки з кількома результатами	uk
dc.subject	підказки ланцюжка думок	uk
dc.subject	підказки ролей	uk
dc.subject	інтеграція CI/CD	uk
dc.subject	zero-shot prompting	en
dc.subject	few-shot prompting	en
dc.subject	chain-of-thought prompting	en
dc.subject	role prompting	en
dc.subject	CI/CD integration	en
dc.title	Prompt engineering for large language models in test case generation	en
dc.title.alternative	Швидка розробка великих мовних моделей для генерації тестових випадків	uk
dc.type	Article, professional native edition
dc.type	Article
dc.identifier.udc	004.8:004.415.53
dc.relation.references	Adu, G. (2024). Artificial Intelligence in software testing: Test scenario and case generation with an AI model (gpt-3.5- turbo) using prompt engineering, fine-tuning and retrieval augmented generation techniques. (Master’s Thesis, University of Eastern, Joensuu, Finland).	en
dc.relation.references	Alagarsamy, S., Tantithamthavorn, C., Takerngsaksiri, W., Arora, C., & Aleti, A. (2025). Enhancing large language models for text-to-testcase generation. Journal of Systems and Software, 230, article number 112531. doi: 10.1016/j. jss.2025.112531.	en
dc.relation.references	Alshahwan, N., Chheda, J., Finogenova, A., Gokkaya, B., Harman, M., Harper, I., Marginean, A., Sengupta, S., & Wang,E. (2024). Automated unit test improvement using large language models at Meta. In M. d’Amorim (Ed.), Companion proceedings of the 32nd ACM international conference on the foundations of software engineering (pp. 185-196). New York: Association for Computing Machinery. doi: 10.1145/3663529.3663839.	en
dc.relation.references	Anasuri, S. (2024). Prompt engineering best practices for code generation tools. International Journal of Emerging Trends in Computer Science and Information Technology, 5(1), 69-81. doi: 10.63282/3050-9246.IJETCSIT-V5I1P108.	en
dc.relation.references	Belzner, L., Gabor, T., & Wirsing, M. (2023). Large language model assisted software engineering: Prospects, challenges, and a case study. In B. Steffen (Ed.), Bridging the gap between AI and reality (pp. 355-374). Cham: Springer. doi: 10.1007/978-3-031-46002-9_23.	en
dc.relation.references	Cain, W. (2024). Prompting change: Exploring prompt engineering in large language model AI and its potential to transform education. TechTrends, 68(1), 47-57. doi: 10.1007/s11528-023-00896-0.	en
dc.relation.references	Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2025). Unleashing the potential of prompt engineering for large language models. Patterns, 6(6), article number 101260. doi: 10.1016/j.patter.2025.101260.	en
dc.relation.references	Clavié, B., Ciceu, A., Naylor, F., Soulié, G., & Brightwell, T. (2023). Large language models in the workplace: A case study on prompt engineering for job type classification. In E. Métais, F. Meziane, V. Sugumaran, W. Manning & S. Reiff-Marganiec (Eds.), Natural language processing and information systems (pp. 3-17). Cham: Springer. doi: 10.1007/978-3-031-35320-8_1.	en
dc.relation.references	Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., & Zhang, J.M. (2023). Large language models for software engineering: Survey and open problems. In Proceedings of the IEEE/ACM international conference on software engineering: Future of software engineering (pp. 31-53). Melbourne: IEEE. doi: 10.1109/ICSEFoSE59343.2023.00008.	en
dc.relation.references	Feng, S., & Chen, C. (2024). Prompting is all you need: Automated android bug replay with large language models. In A. Paiva & R. Abreu (Eds.), Proceedings of the 46th IEEE/ACM international conference on software engineering (article number 67). New York: Association for Computing Machinery. doi: 10.1145/3597503.3608137.	en
dc.relation.references	Gao, A. (2023). Prompt engineering for large language models. SSRN. doi: 10.2139/ssrn.4504303.	en
dc.relation.references	Grabb, D. (2023). The impact of prompt engineering in large language model performance: A psychiatric example. Journal of Medical Artificial Intelligence, 6, article number 20. doi: 10.21037/jmai-23-71.	en
dc.relation.references	Jiang, E., Olson, K., Toh, E., Molina, A., Donsbach, A., Terry, M., & Cai, C.J. (2022). PromptMaker: Prompt-based prototyping with large language models. In S. Barbosa, C. Lampe, C. Appert & D.A. Shamma (Eds.), CHI Conference on human factors in computing systems extended abstracts (article number 35). New York: Association for Computing Machinery. doi: 10.1145/3491101.3503564.	en
dc.relation.references	Levitskyi, S., & Mokin, V. (2025). Analysis of benchmark tests of large language models’ resilience to disinformation and various types of manipulation. Retrieved from http://ir.lib.vntu.edu.ua/handle/123456789/49249.	en
dc.relation.references	Lim, S., & Schmälzle, R. (2023). Artificial intelligence for health message generation: An empirical study using a large language model (LLM) and prompt engineers. Frontiers in Communication, 8, article number 1129082. doi: 10.3389/ fcomm.2023.1129082.	en
dc.relation.references	Naimi, L., Manaouch, M., & Jakimi, A. (2024). A new approach for automatic test case generation from use case diagram using LLMs and prompt engineering. In Proceedings of the international conference on circuit, systems and communication (pp. 1-5). Fes: IEEE. doi: 10.1109/ICCSC62074.2024.10616548.	en
dc.relation.references	Nayyar, A., Vairamani, A.D., & Kaswan, K. (2025). Mastering prompt engineering: Deep insights for optimizing large language models (LLMs). London: Elsevier. doi: 10.1016/C2024-0-00708-4.	en
dc.relation.references	Novakovsky, A., & Yalovega, I. (2025). Categorisation of the capabilities of large language models of artificial intelligence. In Proceedings of the 29th international youth forum “Radio electronics and youth in the 21st century” (pp. 296-298). Kharkiv: Kharkiv National University of Radio Electronics.	en
dc.relation.references	Plein, L., Ouédraogo, W.C., Klein, J., & Bissyandé, T.F. (2024). Automatic generation of test cases based on bug reports: A feasibility study with large language models. In A. Paiva & R. Abreu (Eds.), Proceedings of the 2024 IEEE/ACM 46th international conference on software engineering: Companion proceedings (pp. 360-361). New York: Association for Computing Machinery. doi: 10.1145/3639478.3643119.	en
dc.relation.references	Pornprasit, C., & Tantithamthavorn, C. (2024). Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology, 175, article number 107523. doi: 10.1016/j.infsof.2024.107523.	en
dc.relation.references	Radcliffe, T., Lockhart, E., & Wetherington, J. (2024). Automated prompt engineering for semantic vulnerabilities in large language models. Authorea. doi: 10.22541/au.172348895.52207804/v1.	en
dc.relation.references	Sahoo, P., Singh, A.K., Saha, S., Jain, V., Mondal, S., & Chadha, A. (2024). A systematic survey of prompt engineering in large language models: Techniques and applications. ArXiv. doi: 10.48550/arXiv.2402.07927.	en
dc.relation.references	Schäfer, M., Nadi, S., Eghbali, A., & Tip, F. (2023). An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1), 85-105. doi: 10.1109/TSE.2023.3334955.	en
dc.relation.references	Strobelt, H., Webson, A., Sanh, V., Hoover, B., Beyer, J., Pfister, H., & Rush, A.M. (2022). Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE Transactions on Visualization and Computer Graphics, 29(1), 1146-1156. doi: 10.1109/TVCG.2022.3209479.	en
dc.relation.references	Vatsal, S., & Dubey, H. (2024). A survey of prompt engineering methods in large language models for different NLP tasks. ArXiv. doi: 10.48550/arXiv.2407.12994.	en
dc.relation.references	Velásquez-Henao, J.D., Franco-Cardona, C.J., & Cadavid-Higuita, L. (2023). Prompt engineering: A methodology for optimizing interactions with AI-Language models in the field of engineering. Dyna, 90, 9-17.	en
dc.relation.references	Wang, C.Y. (2025). Application and optimization of prompt engineering techniques for code generation in large language models.	en
dc.relation.references	Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., & Wang, Q. (2024). Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering, 50(4), 911-936. doi: 10.1109/TSE.2024.3368208.	en
dc.relation.references	Yurchak, I., Kychuk, O., Oksentyuk, V., & Khich, A. (2024). Prompting techniques for enhancing the use of large language models. Computer Systems and Networks, 6(2), 286-300. doi: 10.23939/csn2024.02.268.	en
dc.identifier.doi	https://doi.org/10.31649/vitce/1.2026.22
dc.identifier.orcid	https://orcid.org/0009-0007-9398-0966

Files in this item

Name:: 202689.pdf
Size:: 785.0Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Інформаційні технології та комп'ютерна інженерія. 2026. № 1 [13]

Show simple item record

Prompt engineering for large language models in test case generation

Files in this item

This item appears in the following Collection(s)

Related items

Аналіз еталонних тестів стійкості великих мовних моделей до дезінформації та різних видів маніпуляцій ﻿

Підвищення захищеності корпоративних комп'ютерних мереж на основі AI-агентів для аналізу загроз у середовищі n8n та гібридного методу адаптивного реагування ﻿

Effective prompt engineering ﻿

Аналіз еталонних тестів стійкості великих мовних моделей до дезінформації та різних видів маніпуляцій

Підвищення захищеності корпоративних комп'ютерних мереж на основі AI-агентів для аналізу загроз у середовищі n8n та гібридного методу адаптивного реагування

Effective prompt engineering