News Classification using Natural Language Processing with TF-IDF and Multinomial Naïve Bayes

Nadira Alifia Ionendri; Feri Candra; Afdi Rizal

doi:10.52158/jacost.v6i1.1099

News Classification using Natural Language Processing with TF-IDF and Multinomial Naïve Bayes

Nadira Alifia Ionendri Universitas Riau
Feri Candra Universitas Riau
Afdi Rizal Badan Pusat Statistik Provinsi Riau

DOI: https://doi.org/10.52158/jacost.v6i1.1099
I will put the dimension here

Keywords: news, classification, NLP

Abstract

Online news contains valuable insights into public phenomena that can support statistical analysis by institutions like BPS Riau. However, current methods of classifying news are manual, time-consuming, and prone to human error. This study proposes an automated news classification system using Natural Language Processing (NLP) techniques with Term Frequency–Inverse Document Frequency (TF-IDF) for feature extraction and the Multinomial Naïve Bayes algorithm for classification. The dataset was collected via web scraping and manually labeled across five statistical categories: poverty, unemployment, democracy, inflation, and economic growth. The system achieved a validation accuracy of 83%, a test accuracy of 90%, with an average precision of 0.85, recall of 0.93, and f1-score of 0.87. These results demonstrate that the proposed system can significantly reduce the manual workload of news classification and be practically implemented by BPS Riau to support accurate and timely statistical reporting.

Downloads

Download data is not yet available.

References

Fakultas Hukum, Universitas Muhammadiyah Sumatera Utara, T. H. Lubis, I. Koto, and Fakultas Hukum, Universitas Muhammadiyah Sumatera Utara, “Diskursus Kebenaran Berita Berdasarkan Undang-Undang Nomor 40 Tahun 1999 Tentang Pers Dan Kode Etik Jurnalistik,” LEGA LATA J. Ilmu Huk., vol. 5, no. 2, pp. 231–250, Jul. 2020, doi: 10.30596/dll.v5i2.4169.

M. Agarwal, “An Overview of Natural Language Processing,” Int. J. Res. Appl. Sci. Eng. Technol., vol. 7, no. 5, pp. 2811–2813, May 2019, doi: 10.22214/ijraset.2019.5462.

D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,” Multimed. Tools Appl., vol. 82, no. 3, pp. 3713–3744, Jan. 2023, doi: 10.1007/s11042-022-13428-4.

F. Delfariyadi, A. Helen, and S. Yuliawati, “Klasifikasi Sentimen Judul Berita Pemberitaan COVID-19 Tahun 2021 pada Media DetikHealth,” J. Inf. Eng. Educ. Technol., vol. 6, no. 2, pp. 50–57, Dec. 2022, doi: 10.26740/jieet.v6n2.p50-57.

F. K. Khaiser, A. Saad, and C. Mason, “Sentiment Analysis Of Students’ Feedback On Institutional Facilities Using Text-Based Classification And Natural Language Processing (NLP),” J. Lang. Commun., vol. 10, no. 1, pp. 101–111, Mar. 2023, doi: 10.47836/jlc.10.01.06.

Sowmya V. B., B. Majumder, A. Gupta, and H. Surana, Practical natural language processing: a comprehensive guide to building real-world NLP systems, First edition. Sebastopol, CA: O’Reilly Media, 2020.

A. C. Müller and S. Guido, Introduction to machine learning with Python: a guide for data scientists, First edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly, 2016.

M. Khder, “Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application,” Int. J. Adv. Soft Comput. Its Appl., vol. 13, no. 3, pp. 145–168, Dec. 2021, doi: 10.15849/IJASCA.211128.11.

H. S. Obaid, S. A. Dheyab, and S. S. Sabry, “The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Accuracy of Machine Learning,” in 2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON), Jaipur, India: IEEE, Mar. 2019, pp. 279–283. doi: 10.1109/IEMECONX.2019.8877011.

C. Fan, M. Chen, X. Wang, J. Wang, and B. Huang, “A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data,” Front. Energy Res., vol. 9, p. 652801, Mar. 2021, doi: 10.3389/fenrg.2021.652801.

M. A. Rosid, A. S. Fitrani, I. R. I. Astutik, N. I. Mulloh, and H. A. Gozali, “Improving Text Preprocessing For Student Complaint Document Classification Using Sastrawi,” IOP Conf. Ser. Mater. Sci. Eng., vol. 874, no. 1, p. 012017, Jun. 2020, doi: 10.1088/1757-899X/874/1/012017.

Muhammad Ikram Kaer Sinapoy, Yuliant Sibaroni, and Sri Suryani Prasetyowati, “Comparison of LSTM and IndoBERT Method in Identifying Hoax on Twitter,” J. RESTI Rekayasa Sist. Dan Teknol. Inf., vol. 7, no. 3, pp. 657–662, Jun. 2023, doi: 10.29207/resti.v7i3.4830.

A. O. Salau and S. Jain, “Feature Extraction: A Survey of the Types, Techniques, Applications,” in 2019 International Conference on Signal Processing and Communication (ICSC), NOIDA, India: IEEE, Mar. 2019, pp. 158–164. doi: 10.1109/ICSC45622.2019.8938371.

Nanda Ihwani Saputri, Yuliant Sibaroni, and Sri Suryani Prasetiyowati, “Covid-19 Fake News Detection on Twitter Based on Author Credibility Using Information Gain and KNN MethodsCovid-19 Fake News Detection on Twitter Based on Author Credibility Using Information Gain and KNN Methods,” J. RESTI Rekayasa Sist. Dan Teknol. Inf., vol. 7, no. 1, pp. 185–192, Feb. 2023, doi: 10.29207/resti.v7i1.4871.

B. P. Zen, I. Susanto, and D. Finaliamartha, “TF-IDF Method and Vector Space Model Regarding the Covid-19 Vaccine on Online News,” SinkrOn, vol. 6, no. 1, pp. 69–79, Oct. 2021, doi: 10.33395/sinkron.v6i1.11179.

T. Jiang, J. L. Gradus, and A. J. Rosellini, “Supervised Machine Learning: A Brief Primer,” Behav. Ther., vol. 51, no. 5, pp. 675–687, Sep. 2020, doi: 10.1016/j.beth.2020.05.002.

Angga Aditya Permana et al., Machine Learning. in I. PT Global Eksekutif Teknologi, 2023.

M. Hasnain, M. F. Pasha, I. Ghani, M. Imran, M. Y. Alzahrani, and R. Budiarto, “Evaluating Trust Prediction and Confusion Matrix Measures for Web Services Ranking,” IEEE Access, vol. 8, pp. 90847–90861, 2020, doi: 10.1109/ACCESS.2020.2994222.

F. Rahmad, Y. Suryanto, and K. Ramli, “Performance Comparison of Anti-Spam Technology Using Confusion Matrix Classification,” IOP Conf. Ser. Mater. Sci. Eng., vol. 879, no. 1, p. 012076, Jul. 2020, doi: 10.1088/1757-899X/879/1/012076.

Y. Kalmukov, “Using Word Clouds For Fast Identification Of Papers’ Subject Domain And Reviewers’ Competences15,” vol. 60, 2021.

Published

2025-06-24

How to Cite

[1]

Nadira Alifia Ionendri, Feri Candra, and Afdi Rizal, “News Classification using Natural Language Processing with TF-IDF and Multinomial Naïve Bayes”, J. Appl. Comput. Sci. Technol., vol. 6, no. 1, pp. 37 - 45, Jun. 2025.

Download Citation

Issue

Vol 6 No 1 (2025): Juni 2025

Section

Articles

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Pernyataan Hak Cipta dan Lisensi

Dengan mengirimkan manuskrip ke Journal of Applied Computer Science and Technology (JACOST), penulis setuju dengan kebijakan ini. Tidak diperlukan persetujuan dokumen khusus.

Hak cipta pada setiap artikel adalah milik penulis.
Penulis mempertahankan semua hak mereka atas karya yang diterbitkan, tak terbatas pada hak-hak yang diatur dalam laman ini.
Penulis mengakui bahwa Journal of Applied Computer Science and Technology (JACOST) sebagai yang pertama kali mempublikasikan dengan lisensi Creative Commons Atribusi 4.0 Internasional (CC BY-SA).
Penulis dapat memasukan tulisan secara terpisah, mengatur distribusi non-ekskulif dari naskah yang telah terbit di jurnal ini kedalam versi yang lain (misal: dikirim ke respository institusi penulis, publikasi kedalam buku, dll), dengan mengakui bahwa naskah telah terbit pertama kali pada Journal of Applied Computer Science and Technology (JACOST);
Penulis menjamin bahwa artikel asli, ditulis oleh penulis yang disebutkan, belum pernah dipublikasikan sebelumnya, tidak mengandung pernyataan yang melanggar hukum, tidak melanggar hak orang lain, tunduk pada hak cipta yang secara eksklusif dipegang oleh penulis.
Jika artikel dipersiapkan bersama oleh lebih dari satu penulis, setiap penulis yang mengirimkan naskah menjamin bahwa dia telah diberi wewenang oleh semua penulis bersama untuk menyetujui hak cipta dan pemberitahuan lisensi (perjanjian) atas nama mereka, dan setuju untuk memberi tahu rekan penulis persyaratan kebijakan ini. Journal of Applied Computer Science and Technology (JACOST) tidak akan dimintai pertanggungjawaban atas apa pun yang mungkin timbul karena perselisihan internal penulis.

Lisensi :

Journal of Applied Computer Science and Technology (JACOST) diterbitkan berdasarkan ketentuan Lisensi Creative Commons Atribusi 4.0 Internasional (CC BY-SA). Lisensi ini mengizinkan setiap orang untuk :.

Berbagi — menyalin dan menyebarluaskan kembali materi ini dalam bentuk atau format apapun;
Adaptasi — menggubah, mengubah, dan membuat turunan dari materi ini untuk kepentingan apapun.

Lisensi :

Atribusi — Anda harus mencantumkan nama yang sesuai, mencantumkan tautan terhadap lisensi, dan menyatakan bahwa telah ada perubahan yang dilakukan. Anda dapat melakukan hal ini dengan cara yang sesuai, namun tidak mengisyaratkan bahwa pemberi lisensi mendukung Anda atau penggunaan Anda.
BerbagiSerupa — Apabila Anda menggubah, mengubah, atau membuat turunan dari materi ini, Anda harus menyebarluaskan kontribusi Anda di bawah lisensi yang sama dengan materi asli.