INVESTIGATING TERM WEIGHTING SCHEMES ON THE CLASSIFICATION PERFORMANCE FOR THE IMBALANCED TEXT DATA

Afra Al Manei*, Iman Al-Hasani, Ronald Wesonga

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The effect of term weighting (TW) on the classification has been found to yield better results for the text data classification problem. However, little evidence exists for the essential differences among different TW schemes on the classification performance. In this study, we present the results of an investigation of three most popular TW schemes, namely, count, term frequency-inverse document frequency (TFIDF) and term frequency-inverse category frequency (TFICF) under the multinomial Naive Bayes (MNB) and support vector machine (SVM) classification algorithms using imbalanced text data. Our results revealed that the count weighting scheme with the MNB gives a higher macro-average recall compared to the other schemes with SVM. On the other hand, the TFICF with the SVM generates a higher macro-average recall compared to the other two schemes. The findings suggest that TW schemes have different effects on classification of imbalanced text data. Whereas the count weighting scheme performs better in classifying text data using the MNB, the same count scheme with SVM seems to handle the imbalanced data issue better than the count under the MNB classifier. Therefore, our findings reveal that the effect of TW schemes on the classification performance of imbalanced text data can greatly improve when the count weighting scheme is used with MNB and the TFICF with SVM classifier, respectively. This study is significant as it recommends a benchmark for the use and application of TW schemes for the classification algorithms with imbalanced text data.
Original languageEnglish
Pages (from-to)63-82
JournalAdvances and Applications in Statistics
Volume78
Issue numberVol. 78 (2022)
Publication statusPublished - Jun 27 2022

Cite this