Design and Preparation of Persian Labeled Dataset from COVID-19 News for Fake News Detection

Zahed, Forough; Bahrani, Mohammad; Mansouri, Alireza

doi:10.30465/lsi.2024.47711.1729

Design and Preparation of Persian Labeled Dataset from COVID-19 News for Fake News Detection

Document Type : .

Authors

Forough Zahed ¹

Mohammad Bahrani ²

Alireza Mansouri ³

¹ Department of Computer Science, Faculty of Statistics, Mathematics and Computer, Allameh Tabataba'i University, Tehran, Iran

² Department of Computer Science, Faculty of Statistics, Mathematics and Computer, Allameh Tabatab'i University, Tehran, Iran

³ ICT Research Institute (ITRC), Tehran, Iran

10.30465/lsi.2024.47711.1729

Abstract

Fake news detection using content features have attracted many researchers in the last few years. These approaches rely mainly on news datasets and analyzing their style and content. Although there are some fake news datasets in English, fake news detection in the Persian language suffers from the lack of suitable datasets. This article introduces a manually labeled Persian fake news dataset, containing about 5000 posts related to COVID-19 and extracted from Telegram messenger. The process of building the dataset is done in two stages: 1) data collection and pre-processing; and 2) labeling manually using a settled rule set and an established framework. In the labeling stage, seven tasks have been used for labeling, including: 1) Factual; 2) Hate, blame, and negative speech; 3) Rising moral, encouragement, and advise; 4) Political news; 5) Death statistics; 6) Cure, medicine, and health care; and 7) Worth to be considered for fact checking. For each labeling task, 3 labels including “Yes”, “No”, and “Can’t decide” are used. The main labeling task, i.e. “Factual” task is assigned to two annotators and in case of disagreement between annotators, the label assigned by third annotator is accepted. The kappa measure for inter-annotators agreement obtained equal to 0.706 that is in substantial range. This dataset is about 10 times larger in comparison to similar Persian datasets and can be used for not only fake news studies but also some other Persian Natural Language Processing (NLP) studies.

Keywords

fake news

COVID-19 pandemic

labeled dataset

social networks

قیومی، مسعود (1401). «تحلیل آماری اخبار جعلی فارسی مربوط به کوید-19»، فصلنامه علمی - پژوهشی زبان‌شناسی اجتماعی، دوره 5، شماره 4، صص 35-52.

Ameur, Mohamed Seghir Hadj, and Hassina Aliane (2021). “Arabic Covid-19 Multi-Label Fake News and Hate Speech Detection Dataset”, Procedia Computer Science, vol. 189: 232-241.

Aphiwongsophon, S., and P. Chongstitvatana (2018). “Detecting Fake News with Machine Learning Method”, In 15^th International Conference on Electrical Engineering/Electronics,Computer, Telecommunications and Information Technology (Ecti-Con), Chiang Rai, Thailand.

Carletta, Jean (1996). “Assessing Agreement on Classification Tasks: The Kappa Statistic”, Computational Linguistics, 22(2): 249–254.

Crestani F., and P. Rosso (2020). “The Role of Personality and Linguistic Patterns in Discriminating Between Fake News Spreaders and Fact Checkers”, In 25^thInternational Conference on Applications of Natural Language to Information Systems, Saarbrucken, Germany.

Elhadad, Mohamed K., Kin Fun Li, and Fayez Gebali (2021). “Covid-19-Fakes: A Twitter (Arabic/English) Dataset for Detecting Misleading Information on Covid-19”, In Leonard Barolli, Kin Fun Li, and Hiroyoshi Miwa, Editors, Advances in Intelligent Networking and Collaborative Systems, pp. 256–268, Springer International Publishing.

Kaliyar, Rohit Kumar, Anurag Goswami, and Pratik Narang (2021). “Fake News Detection In Social Media with a Bert-Based Deep Learning Approach”, Multimedia Tools & Applications, vol. 80, 11765–11788.

Kumar, S., and N. Shah (2018). “False Information on Web and Social Media: A Survey”, arXiv:1804.08559.

Posadas-Durán, J. P., H. Gómez-Adorno, G. Sidorov, and J. J. M. Escobar (2019). “Detection of Fake News in a New Corpus for the Spanish Language”, Journal of Intelligent & Fuzzy Systems, 36(5): 4869-4876.

Saghayan, M. H., S. F. Ebrahimi, and M. Bahrani (2021). “Exploring the Impact of Machine Translation on Fake News Detection: A Case Study on Persian Tweets about COVID-19”, Proceedings of 29^th Iranian Conference on Electrical Engineering (ICEE), pp. 540-544, IEEE.

Singh, Vivek K., Rupanjal Dasgupta, Darshan Sonagra, Karthik Raman, and Isha Ghosh (2017). “Automated Fake News Detection Using Linguistic Analysis and Machine Learning”, In International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation, pp. 1-3.

Shahi, Gautam Kishore, and Durgesh Nandini (2020). “FakeCovid - A Multilingual Cross-Domain Fact Check News Dataset for Covid-19”, arXiv:2006.11343.

Shin J., L. Jian, K. Driscol, and F. Bar (2018). “The Diffusion of Misinformation on Social Media: Temporal Pattern, Message, and Source”, Computers in Human Behavior, vol. 8:278–287.

Shu K., D. Mahudeswaran, S. Wang, D. Lee, and H. Liu (2020). “Fakenewsnet: A Data Repository with News Content, Social Context, and Spatio Temporal Information for Studying Fake News on Social Media”, Big Data 8(3):171–188.

Vijayaraghavan, S., Y. Wang, Z. Guo, J. Voong, W. Xu, A. Nasseri, J. Cai, L. Li, K. Vuong, and E. Wadhwa (2020). “Fake news Detection with Different Models”, arXiv:2003.04978.

Wang, William Yang (2017). “Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection”, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada.

Zhou, Xinyi and Reza Zafarani, (2021), “A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities”, ACM Computing Surveys, 53(5): 1-40.

Volume 19, Issue 37
February 2024
Pages 173-192

XML

PDF 669.8 K

Article View	492
PDF Download	351

Design and Preparation of Persian Labeled Dataset from COVID-19 News for Fake News Detection

Volume 19, Issue 37February 2024Pages 173-192

Files

Share

How to cite

Statistics

Volume 19, Issue 37
February 2024
Pages 173-192