Вплив синтаксичних зв’язків у реченнях на якість ідентифікації токсичних коментарів в соціальній мережі

Штовба, С. Д.; Штовба, О. В.; Яхимович, О. В.; Петричко, М. В.; Shtovba, S.; Shtovba, O.; Yahymovych, O.; Petrychko, M.

dc.contributor.author	Штовба, С. Д.	uk
dc.contributor.author	Штовба, О. В.	uk
dc.contributor.author	Яхимович, О. В.	uk
dc.contributor.author	Петричко, М. В.	uk
dc.contributor.author	Shtovba, S.	en
dc.contributor.author	Shtovba, O.	en
dc.contributor.author	Yahymovych, O.	en
dc.contributor.author	Petrychko, M.	en
dc.date.accessioned	2020-09-15T09:51:56Z
dc.date.available	2020-09-15T09:51:56Z
dc.date.issued	2019
dc.identifier.citation	Вплив синтаксичних зв’язків у реченнях на якість ідентифікації токсичних коментарів в соціальній мережі [Електронний ресурс] / С. Д. Штовба, О. В. Штовба, О. В. Яхимович, М. В. Петричко // Наукові праці ВНТУ. – 2019. – № 4. – Режим доступу: https://praci.vntu.edu.ua/index.php/praci/article/view/578/548.	uk
dc.identifier.issn	2307-5376
dc.identifier.uri	http://ir.lib.vntu.edu.ua//handle/123456789/30485
dc.description.abstract	Соціальні мережі все частіше стають середовищем для погроз, образ та інших складових кібербулінгу. В онлайнових соціальних мережах задіяна величезна кількість людей, тому виникає потреба в автоматизації діяльності із захисту користувачів від антисоціального впливу. Одним із важливих напрямків такої діяльності є виявлення токсичних коментарів, що містять погрози, образи, зневагу до оточуючих тощо. Зазвичай ідентифікацію токсичних коментарів здійснюють за статистикою мішка слів та мішка символів. В статті досліджується вплив синтаксичних зв’язків у реченнях на якість ідентифікації токсичних коментарів в соціальній мережі. Під синтаксичними зв’язками розуміються зв'язки із власними назвами, з особовими займенниками, з присвійними займенниками тощо. Всього перевірено двадцять синтаксичних ознак речень. Встановлено, що додаткове врахування трьох специфічних ознак суттєво покращує якість ідентифікації токсичних коментарів. Цими трьома специфічними ознаками є такі: кількість зв'язків з власними назвами в однині, кількість зв'язків, в яких фігурують погані слова та кількість зв'язків між особовими займенниками та поганими словами. Експерименти проведено на основі даних із kaggle-змагання “Toxic Comment Classification Challenge”. Оригінальну kaggle-задачу категоризації токсичних коментарів було модифіковану у задачу класифікації з двома альтернативами: нейтральний коментар та токсичний коментар. Для наших експериментів оригінальну вибірку із 159751 коментарів скорочено до 106590 коментарів через проблеми з автоматичним виділенням синтаксичних ознак тексту. В модифікованій вибірці частка токсичних коментарів становить 12.8%. Для врахування незбалансованості вибірки даних метрикою якості обрано середнє значення частот помилок класифікації кожного типу. Класифікацію здійснено за допомогою дерева рішень. Дерева рішень синтезувалися за двох правил розщеплення: на основі індекса Джині та ентропійного критерію.	uk
dc.description.abstract	Social networks often become a medium for threats, insults and other components of cyberbullying. A huge number of people are involved in online social networks, therefore, there is a need for automation of the activities to protect users from anti-social behavior. One of the important tasks of such activity is the identification of the toxic comments that contain threats, insults, obscene etc. The bag of words statistics and bag of symbols statistics are typical features for the toxic comments identification. The effect of syntactic dependencies in sentences on the quality of identification of the social network toxic comments is studied in the article. Syntactic dependences are relationships with proper nouns, personal pronouns, possessive pronouns, etc. 20 syntactic features of sentences have been verified in the total. The article shows that 3 additional specific features significantly improve the quality of toxic comments identification. These three features are: the number of dependences with proper nouns in the singular, the number of dependences that contain bad words, and the number of dependences between personal pronouns and bad words. The experiments are based on data from kaggle- competition "Toxic Comment Classification Challenge". The original kaggle-task of categorizing the toxic comments was modified to the classification one with two alternatives: a neutral comment and a toxic comment. For our experiments, the original dataset with 159751 comments was reduced to 106590 comments due to problems with human-free extraction of the syntactic features. The toxic comment rate is 12.8% in the modified dataset. We use mean of the error rates for each types of misclassification as the metric of quality due to unbalanced dataset. A decision tree is used as a classifier. The decision trees were synthesized for two splitting rules: Gini index and entropy criterion.	en
dc.language.iso	uk_UA	uk_UA
dc.publisher	ВНТУ	uk
dc.relation.ispartof	Наукові праці ВНТУ. – 2019. – № 4.	uk
dc.relation.uri	https://praci.vntu.edu.ua/index.php/praci/article/view/578/548
dc.subject	аналіз тексту	uk
dc.subject	обробка природньої мови	uk
dc.subject	синтаксичні зв’язки	uk
dc.subject	токсичні коментарі	uk
dc.subject	соціальна мережа	uk
dc.subject	ідентифікація	uk
dc.subject	автоматичне навчання	uk
dc.subject	відбір ознак	uk
dc.subject	text mining	en
dc.subject	natural language processing	en
dc.subject	syntactic dependencies	en
dc.subject	toxic comments	en
dc.subject	social network	en
dc.subject	identification	en
dc.subject	machine learning	en
dc.subject	features selection	en
dc.title	Вплив синтаксичних зв’язків у реченнях на якість ідентифікації токсичних коментарів в соціальній мережі	uk
dc.title.alternative	Impact of the syntactic dependencies in the sentences on the quality of the identification of the toxic comments in the social networks	en
dc.type	Article
dc.identifier.udc	004.855
dc.relation.references	Fine-Grained Classification of Offensive Language / J. Risch, E. Krebs, A. Löser [et all] // Proc. of GermEval 2018, 14th Conference on Natural Language Processing. Vienna, Austria, 2018. – P. 38 – 44.	en
dc.relation.references	Anatomy of online hate: developing a taxonomy and machine learning models for identifying and classifying hate in online news media/ J. Salminen, H. Almerekhi, M. Milenković [et all] // Twelfth International AAAI Conference on Web and Social Media. – 2018. – P. 330 – 339.	en
dc.relation.references	Srivastava S. Identifying Aggression and Toxicity in Comments using Capsule Network / S. Srivastava, P. Khurana, V. Tewari // Proc. of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018). – 2018. – P. 98 – 105.	en
dc.relation.references	Sood S. O. Using Crowdsourcing to Improve Profanity Detection / S. O. Sood, J. Antin, E. F. Churchill // Association for the Advancement of Artificial Intelligence. Spring Symposium: Wisdom of the Crowd. – 2012. – P. 69 – 74.	en
dc.relation.references	Mohammad F. Is preprocessing of text really worth your time for toxic comment classification? / F. Mohammad // Proc. of Inter. Conference on Artificial Intelligence. CSREA Press. – 2018. – P. 447 – 453.	en
dc.relation.references	Bisikalo O. Development of the method for filtering verbal noise while search keywords for the English text / O. Bisikalo , A. Yahimovich , Y. Yahimovich // Technology Audit and Production Reserves. – 2018. – № 6. – P. 33 – 41.	en
dc.relation.references	Toxic Comment Classification Challenge. Available [Електронний ресурс] / Режим доступу : https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge.	en
dc.relation.references	Stop Illegal Comments: A Multi-Task Deep Learning Approach [Електронний ресурс] / A. Elnaggar, B. Waltl, I. Glasera [et all] // Software Engineering for Business Information Systems, Technische Universitat¨ Munchen, Germany. – 2018. – Режим доступу : https://arxiv.org/pdf/1810.06665.pdf.	en
dc.relation.references	Kumar S. Antisocial Behavior on the Web: Characterization and Detection / S. Kumar, J. Cheng, J. Leskovec // Proceedings of the 26th International Conference on World Wide Web Companion. – International World Wide Web Conferences Steering Committee. – 2017. – P. 947 – 950.	en
dc.identifier.doi	https://doi.org/10.31649/2307-5376-2019-4-35-42

Файли в цьому документі

Ім'я:: Петричко.pdf
Розмір:: 730.0Kb
Формат:: PDF

Відкрити

Даний документ включений в наступну(і) колекцію(ї)

Наукові праці ВНТУ. 2019. № 4 [7]

Показати скорочену інформацію