Machine Learning-Based News Classification: Comparison of KNN Accuracy with Hyperparameter Tuning

Main Article Content

Muhamad Nur Gunawan
Nuryasin
Syopiansyah Jaya Putra
Sarah Arhami

Abstract

This study aims to develop an automatic news text classification system using the K-Nearest Neighbor (KNN) algorithm with a hyperparameter tuning approach. Manual classification by editors is considered inefficient, so an accurate and lightweight automated approach is needed. News datasets were obtained through web scraping of bbc.com sites  with five main categories, namely business, technology, entertainment, science, and health. This research follows the CRISP-DM methodology which consists of six stages: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Feature representation is done using TF-IDF and preprocessing includes stopword removal as well as pattern-based noise cleaning. Two experimental scenarios were performed: first, using complete data without balancing; Second, using more balanced undersampling data. Hyperparameter tuning was performed with k-value variations from 1 to 50 and validated with 5-fold cross-validation. The results showed that the model with balanced data and a value of k=11 produced an accuracy, precision, recall, and F1-score of 95%. The system was also successfully implemented into a Flask-based web application that can be used by news editors for real-time text classification. This study emphasizes the importance of parameter optimization and preprocessing in text classification and shows that simple algorithms such as KNN remain competitive if supported by good data processing.

Article Details

How to Cite
Muhamad Nur Gunawan, Nuryasin, Syopiansyah Jaya Putra, & Sarah Arhami. (2025). Machine Learning-Based News Classification: Comparison of KNN Accuracy with Hyperparameter Tuning . Jurnal Informasi Dan Teknologi, 114-120. https://doi.org/10.60083/jidt.vi0.661
Section
Articles

References

[1] C.-H. Chan, A. Sun, and E.-P. Lim, “Automated Online News Classification with Personalization BT - 4th Int. Conference Available: [Online]. Asian Digit. Libr.,” 2001, pp. 1–10. [Online]. Available: http://ncsi-net.ncsi.iisc.ernet.in/gsdl/collect/icco/index/assoc/HASH01de.dir/doc
[2] V. Korde, “Text Classification and Classifiers: A Survey,” Int. J. Artif. Intell. Appl., vol. 3, no. 2, pp. 85–99, 2012, doi: 10.5121/ijaia.2012.3208.
[3] R. Jindal, R. Malhotra, and A. Jain, “Techniques for text classification: Literature review and current trends,” Webology, vol. 12, no. 2, pp. 1–28, 2015.
[4] V. Bijalwan, V. Kumar, P. Kumari, and J. Pascual, “KNN based machine learning approach for text and document mining,” Int. J. Database Theory Appl., vol. 7, no. 1, pp. 61–70, 2014, doi: 10.14257/ijdta.2014.7.1.06.
[5] B. Trstenjak, S. Mikac, and D. Donko, “KNN with TF-IDF based framework for text categorization,” Procedia Eng., vol. 69, pp. 1356–1364, 2014, doi: 10.1016/j.proeng.2014.03.129.
[6] Z. E. Rasjid and R. Setiawan, “Performance Comparison and Optimization of Text Document Classification using k-NN and Naïve Bayes Classification Techniques,” Procedia Comput. Sci., vol. 116, pp. 107–112, 2017, doi: 10.1016/j.procs.2017.10.017.
[7] M. DEI, “Hyperparameter Tuning Explained Tuning Phases, Tuning Methods, Bayesian Optimization, and Sample Code!” 2019. [Online]. Available: https://towardsdatascience.com/hyperparameter-tuning-explained-d0ebb2bald35
[8] S. M. H. Dadgar, M. S. Araghi, and M. M. Farahani, “A novel text mining approach based on TF-IDF and support vector machine for news classification BT - Proc. 2nd IEEE Int. Conf. Eng. Technol. ICETECH 2016,” 2016, pp. 112–116. doi: 10.1109/ICETECH.2016.7569223.
[9] T. Pranckevičius and V. Marcinkevičius, “Comparison of Naive Bayes, Random Forest, Decision Tree, Support Vector Machines, and Logistic Regression Classifiers for Text Reviews Classification,” Balt. J. Mod. Comput., vol. 5, no. 2, pp. 221–232, 2017, doi: 10.22364/bjmc.2017.5.2.05.
[10] Z. Jianqiang and G. Xiaolin, “Comparison research on text pre-processing methods on twitter sentiment analysis,” IEEE Access, vol. 5, pp. 2870–2879, 2017, doi: 10.1109/ACCESS.2017.2672677.
[11] M. A. Fauzi, A. Z. Arifin, S. C. Gosaria, and I. S. Prabowo, “Indonesian News Classification Using Naïve Bayes and Two-Phase Feature Selection Model,” Indones. J. Electr. Eng. Comput. Sci., vol. 2, no. 3, pp. 401–408, 2016, doi: 10.11591/ijeecs.v2.i2.pages.
[12] G. Piatetsky, “CRISP-DM, still the top methodology for analytics, data mining, or data science projects.” 2014. [Online]. Available: https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html
[13] Google, “Colaboratory - Google.” 2020. [Online]. Available: https://research.google.com/colaboratory/faq.html
[14] F. (web framework) Wikipedia, “Flask (web framework) Wikipedia.” 2020. [Online]. Available: https://en.wikipedia.org/wiki/Flask_(web_framework)
[15] F. Martinez-Plumed, “CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories,” IEEE Trans. Knowl. Data Eng., vol. 4347, no. c, p. 1, 2019, doi: 10.1109/tkde.2019.2962680.
[16] J. Rajshree, S. B. Gaur, C. K. R., and M. Amit, “Text Classification using KNN with different Features Selection Methods Abstra,” Int. J. Res. Publ. Vol. 8-Issue. 1, July 2018 Text, 2018.
[17] Arifin, “Classification of Emotions in Indonesian TextsUsing K-NN Method,” Int. J. Inf. Electron. Eng., vol. 2, no. 6, 2012, doi: 10.7763/ijiee.2012.v2.237.
[18] M. Sanjay, “Why and how to Cross Validate a Model.” 2018. [Online]. Available: https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f
[19] X. Fang and J. Zhan, “Sentiment analysis using product review data,” J. Big Data, vol. 2, no. 1, 2015, doi: 10.1186/s40537-015-0015-2.
[20] K. L. Sumathy and M. Chidambaram, “Text Mining: Concepts, Applications, Tools and Issues An Overview,” Int. J. Comput. Appl., vol. 80, no. 4, pp. 29–32, 2013, doi: 10.5120/13851-1685.
[21] K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Inf., vol. 10, no. 4, pp. 1–68, 2019, doi: 10.3390/info10040150.
[22] Informatikologi, “Algoritma K-Nearest Neighbor (K-NN) INFORMATIKALOGI.” 2017. [Online]. Available: https://informatikalogi.com/algoritma-k-nn-k-nearest-neighbor/#1
[23] Informatikologi, “Vector Space Model (VSM) dan Pengukuran Jarak pada Information Retrieval (IR) INFORMATIKALOGI.” 2016. [Online]. Available: https://informatikalogi.com/vector-space-model-pengukuran-jarak/#1
[24] N. Newman, R. Fletcher, A. Kalogeropoulos, and R. K. Nielsen, “Digital News Report 2019,” pp. 70–72, 2019, doi: 10.2139/ssrn.2619576.