Improving File Security through an Optimized Auto-Classification Approach Using Learning Models

Açıkgöz, Zeliha

Improving File Security through an Optimized Auto-Classification Approach Using Learning Models

Date

2024

Authors

Açıkgöz, Zeliha

Abstract

PDF dosyalarını hedef alan kötü amaçlı yazılımlar dijital güvenlik açısından ciddi bir tehdit oluşturmaktadır. Bu çalışmada PDF dosyalarının sınıflandırılması için kapsamlı bir yöntem önerilmiştir. Çalışma kapsamında PyPDF2, PDFMiner ve PyMuPDF kütüphaneleri kullanılarak PDF'lerden 43 farklı genel ve yapısal özellik çıkarılmıştır. Çalışmada iki faklı aşama bulunmaktadır. İlk aşamada kullanılan veriseti tek sütun olacak şekilde TF-IDF, N-gram Count Vectorizer ve Word2Vec yöntemleri ile sayısallaştırılarak özellik seçimi yapılmadan model eğitimlerinde kullanılmıştır. İkinci aşamada ise metin içeren sütunlar Word2Vec ile sayısallaştırıldıktan sonra özellik seçim yöntemleri uygulanarak model eğitimlerinde kullanılmıştır. İlk aşamada yedi farklı makine öğrenmesi ve dört farklı derin öğrenme modeli uygulanmıştır. İkinci aşamada ise makine öğrenme modellerine ek özgün tasarlanmış Çok Dallı CNN modeli kullanılmıştır. Özellik seçiminde SelectKBest, Recursive Feature Elimination (RFE) ve Lasso yöntemleri uygulanmıştır. Önerilen Çok Dallı CNN mimarisi özellik seçimi yöntemlerinin sonuçlarına uygulanmıştır. Çok Dallı CNN modeli yapılan test sonucunda Lasso özellik seçimiyle 0.9982 doğruluk değeri elde edilmiştir. Makine öğrenimi modelleriyle yapılan deneyler, özellik çıkarımı olan ve olmayan veri setleri üzerinde değerlendirilmiş ve karşılaştırmalı olarak doğruluk, kesinlik, geri çağırma oranı ve F1 puanı gibi metrikler her iki aşama için de analiz edilmiştir. Çalışma, yaklaşık 30.000 PDF dosyasından oluşan kapsamlı bir veri seti üzerinde test edilmiştir. Elde edilen sonuçlar, PDF tabanlı kötü amaçlı yazılımların tespiti için etkili bir yaklaşım sağlamayı amaçlamaktadır.
Malware targeting PDF files poses a serious threat to digital security. In this study, a comprehensive method is proposed for the classification of PDF files. Within the scope of the study, 43 different general and structural features were extracted from PDFs using PyPDF2, PDFMiner and PyMuPDF libraries. There are two different stages in the study. In the first stage, the dataset used was digitized as a single column with TF-IDF, N-gram Count Vectorizer and Word2Vec methods and used in model training without feature selection. In the second stage, the columns containing text were digitized with Word2Vec and used in model training by applying feature selection methods. In the first stage, seven different machine learning and four different deep learning models were applied. In the second stage, an original designed Multi-Branch CNN model was used in addition to the machine learning models. SelectKBest, Recursive Feature Elimination (RFE) and Lasso methods were applied in feature selection. The proposed Multi-Branch CNN architecture was applied to the results of feature selection methods. As a result of the test conducted on the Multi- Branch CNN model, 0.9982 accuracy value was obtained with Lasso feature selection. Experiments with machine learning models are evaluated on datasets with and without feature extraction, and comparatively, metrics such as accuracy, precision, recall rate, and F1 score are analyzed for both stages. The study is tested on a comprehensive dataset of approximately 30,000 PDF files. The obtained results aim to provide an effective approach for the detection of PDF-based malware.

Keywords

Computer Engineering and Computer Science and Control, Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol

Turkish CoHE Thesis Center URL

Click Here

End Page

80

URI

https://hdl.handle.net/20.500.12416/16013
https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=KOgdn9H3uVnWeb15j2W4h_jXA8d7px3YGpup_G00cXEmXxguEBmZ7o99TBKxhMqt

Collections

Yüksek Lisans Tezleri

Full item page

Google Scholar™

Check

Improving File Security through an Optimized Auto-Classification Approach Using Learning Models

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

OpenAIRE Downloads

OpenAIRE Views

relationships.isProjectOf

relationships.isJournalIssueOf

Abstract

Description

Keywords

Turkish CoHE Thesis Center URL

Fields of Science

Citation

WoS Q

Scopus Q

Source

Volume

Issue

Start Page

End Page

URI

Collections

Google Scholar™

Sustainable Development Goals

SDG data could not be loaded because of an error. Please refresh the page or try again later.