Linguistic Resources and Transformer-based Models for the Machine Translations between Luri and Yazdi Dialects versus Standard Persian

Document Type : .

Authors
1 دانشجوی دکتری، دانشگاه صنعتی شریف، دانشکده مهندسی کامپیوتر، گروه هوش مصنوعی
2 PhD. Student, Department of Computer Engineering, Sharif University of Technology, AI Group
3 Msc. Student, Department of Computer Engineering, Sharif University of Technology, AI Group
4 Research Assistant, Language Processing and Digital Humanities Lab. , Sharif University of Technology
5 Bsc. Student, Department of Computer Engineering, Sharif University of Technology, AI Group
6 Qatar Computing Research Institute Engineering,
7 Associate Professor, Department of Computer Engineering, Sharif University of Technology, AI Group
8 Professor, Department of Computer Engineering, Sharif University of Technology, AI Group
9 Other
Abstract
Despite recent advances in developing language technologies for the standard Persian dialect, the official Iranian language, a large number of Iranian language variations remained computationally unexplored. Iranian languages, e.g., Kurdi, Azeri, and many Persian dialects are examples of low-resource language distinctions lacking significant linguistic resources such as machine-readable lexicons or part-of-speech (POS) taggers. Efforts in developing language technologies for such languages can significantly contribute to language survival in the digital era and promote cultural diversity. To the best of our knowledge, for the first time, we created linguistic resources for the Luri and the Yazdi dialects by introducing the first parallel corpora between these language variations and the modern Persian language. In this study, we train neural encoder-decoders (1) recurrent sequence-to-sequence and (2) transformer-based machine translation models and evaluate the trained model using BLEU score on an unseen test dataset.
Availability of datasets and models: Datasets are available here at https://github.com/language-ml/dataset_yazdi_luri.git
Keywords

داوری و جانی، پریسا و ابراهیم (1397)، »بررسی و توصیف زبان‌شناختی انواع ضمیر در گویش لری کامفیروز«، فصلنامۀ ادبیات و زبان های محلی ایران زمین، دوره 8، شماره 3، ص 47-62
رمضان‌خانی، صدیقه (1391)، »بررسی برخی واژگان یزدی و مقایسه آن‌ها با زبان‌های باستانی«، ششمین همایش پژوهشهای ادبی، تهران
صادق زاده و رمضان‌خانی، محمود و صدیقه (1398)، »بررسی تطبیقی- موضوعی ساختار واژگان در گویش یزدی«، فصلنامه علمی پژوهشی زبان و ادب فارسی، شماره 40.
طاهری، اسفندیار (1391)، »ریشه شناسی چند واژه از لری بویراحمدی«، ادب پژوهی، دوره 6، شماره 20، ص 75-88
عسکری کامران، محمد تقی(1400)، ننه زهرا و پسرش، یزد یادداشت نو.
مجیدی و حق بی، لیلا و فریده (1397)، »نمود فعل در زبان لری و گونه‌های آن، فصلنامه مطالعات زبان ها و گویش های غرب ایران«، دانشکده ادبیات و علوم انسانی، دانشگاه راضی کرمانشاه، سال ششم،شماره 22، ص 93-110
مدرسی، یحیی (1368)، »درآمدی بر جامعه شناسی زبان، تهران«، مؤسسۀ مطالعات و تحقیقات فرهنگی.
محسنی، محمدرضا (1392)، »پان ترکیسم ایران و آذربایجان«، چاپ دوم، نشر سمرقند، ص 131-130
مقیمی و نظری و خالق‌زاده و مقیمی، افضل و جلیل و محمدهادی و جبار (1400)، »فرهنگ واژه‌های لری بویرحمدی«، تهران : زیتون سبز.
Anastasopoulos, A, et al. "Endangered languages meet Modern NLP." Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts. 2020.
Asgari, E and Schütze, H. 2017. Past, Present, Future: "A Computational Investigation of the Typology of Tense in 1000 Languages". In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 113–124, Copenhagen, Denmark. Association for Computational Linguistics.
Baniata, L. H., Ampomah, I., & Park, S. (2021). "A Transformer-Based Neural Machine Translation Model for Arabic Dialects That Utilizes Subword Units"Sensors21(19), 6509.
Găman, M., & Ionescu, R. T. (2020). "The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification". International Journal of Intelligent Systems.
Harrat, S., Meftouh, K., & Smaili, K. (2019). "Machine translation for Arabic dialects (survey)". Information Processing & Management, 56(2), 262-273.
King, B. P. "Practical Natural Language Processing for Low-Resource Languages" (Doctoral dissertation). University of Michigan. (2015).
Haddow, B., Hernández, A., Neubarth, F., & Trost, H. (2013, September). "Corpus development for machine translation between standard and dialectal varieties". In Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants (pp. 7-14).
Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., & Smaili, K. (2015, October). "Machine translation experiments on PADIC: A parallel Arabic dialect corpus". In Proceedings of the 29th Pacific Asia conference on language, information and computation (pp. 26-34).
Mohamed, E., Mohit, B., & Oflazer, K. (2012, July). "Transforming standard Arabic to colloquial Arabic". In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics,Volume 2: Short Papers, 176-180.
Mutton, A., Dras, M., Wan, S., & Dale, R. (2007, June). GLEU: Automatic evaluation of sentence-level fluency. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics(pp.344-351).
Nakov, P., & Ng, H. T. (2012). "Improving statistical machine translation for a resource-poor language using related resource-rich languages"Journal of Artificial Intelligence Research44, 179-222.
Ruiz Costa-Jussà, M., Zampieri, M., & Pal, S. (2018). "A neural approach to language variety translation". In COLING 2018: The 27th International Conference on Computational Linguistics: Proceedings of the Conference: August 20-26, 2018 Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Salloum, W., & Habash, N. (2012, December). Elissa: "A dialectal to standard Arabic machine translation system". In Proceedings of COLING 2012: Demonstration Papers (pp. 385-392).
Sawaf, H. (2010). "Arabic dialect handling in hybrid machine translation". In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers.
Scannell, K. P. (2006). "Machine translation for closely related language pairs". In Proceedings of the Workshop Strategies for developing machine translation for minority languages (pp. 103-109).
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to sequence learning with neural networks"Advances in neural information processing systems27.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N. & Polosukhin, I. (2017). "Attention is all you need"Advances in neural information processing systems30.
Wołk, K., & Koržinek, D. (2016). Comparison and adaptation of automatic evaluation metrics for quality assessment of re-speaking. arXiv preprint arXiv:1601.02789.
[.