A Corpus-based Study of Using Function and Content Words in Persian Authorship Attribution

Document Type : .

Authors

1 General linguistics group, Persian Language & Literature faculty, Allameh Tabatabaiy university, Tehran, Iran

2 Department of Linguistics, Allameh Tabataba'i University

3 Department of Computer, Allameh Tabataba'i University

4 Department of English Language and Translation, Islamic Azad University, Karaj

Abstract

Nowadays, corpora are widely used in authorship attribution. In this research, a corpus of persian contemporary texts was applied to identify the authorship of texts and the effectiveness of function and content words in this task was compared. In order to reach this goal, seven contemporary writers named Hoshang Golshiri, Bozor Alavi, Ahmad Mahmoud, Mahmoud Dolatabadi, Nader Ebrahimi, Jalal Al Ahmad and Gholamhossein Saedi were selected and their books were collected. Then by using this corpus and deep learning algorithms like multilayer perceptron and Long Short Term Memory, effectiveness of function and content words was evaluated. The results of the research indicated that function words based method was superior to content words one in authorship attribution. In addition, pronouns, especially demonstrative and personal pronouns, showed the highest efficiency among the types of function words to determine the author of a text. Moreover, features based on conjunctions and auxiliary verbs were valuable to recognize persian writers.

Keywords