Bilgi Teknolojisi Bölümü Tezleri
Permanent URI for this collectionhttps://hdl.handle.net/20.500.12416/298
Browse
Browsing Bilgi Teknolojisi Bölümü Tezleri by Author "Al-Gartanee, Asmsaa"
Now showing 1 - 1 of 1
- Results Per Page
- Sort Options
Item Citation Count: Al-Gartanee, Asmsaa (2015). The effectiveness of feature selection metrics on the text categorization performance / Özellik belirleme matriksinin metin siniflandirma sisteminin performansi üzerindeki etkisi. Yayımlanmış yüksek lisans tezi. Ankara: Çankaya Üniversitesi, Fen bilimleri EnstitüsüThe effectiveness of feature selection metrics on the text categorization performance(2015) Al-Gartanee, Asmsaa; Çankaya Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Teknolojisi Bilim DalıText Categorization (TC) is an important intelligence information processing technology. This technology has high value in information retrieval, Electronic Governments, information filtering, text databases, digital libraries, and other aspects, but the problem of feature selection is equally or more important than text-categorization. In this thesis, we did our experiments with the help of standard Reuters-21578 dataset, and we discussed many important topics ranging from collecting data, to organizing data and ultimately using the organized data to efficiently conduct tests using the feature selection metrics.The general idea of any feature selection metric is to determine importance of words using some measure that can keep informative words, and remove non-informative words, which can then help the text-categorization engine categorize a document, D, into some category, C. The feature selection metrics that will be discussed in this thesis are: Term frequency-Inverse Document Frequency (TF-IDF), Document Frequency (DF), Mutual Information- Explanation (MI), Chi-square Statistics (CHI), GSS (Galavotti-Sebastiani-Simi) Coefficient – Explanation. It will combine Term frequency-inverse document frequency (TF-IDF) and Documents Frequency (DF) metrics to prepare the texts in a perfect way. After that, those texts will be used by classification process in Weka to get the best learning machines algorithms and the best performance of system, by computing performance measures such as (accuracy, error rate, recall, precision and F-measure). We compare the reusability of popular active learning algorithms for text classification and identify the best classifiers to use in active learning for text classification. All these mentioned measures were computed and plotted.