Towards Inclusive NLP: Evaluating LLMs on Low-Resource Indo-Iranian Languages
Abstract
Multilingual large language models (LLMs) have achieved strong performance in high-resource languages, yet their capabilities in low-resource settings remain underexplored. This gap is particularly severe for several Indo-Iranian languages spoken across Muslim communities, such as Farsi/Dari, Pashto, Kurdish, Balochi, Mazandarani, Gilaki, Luri, and Ossetian. These languages represent tens of millions of speakers but receive limited attention in NLP research. In this paper we present a pilot, systematic evaluation of modern multilingual LLMs across six Indo-Iranian languages spanning high-, medium-, and low-resource levels. We assemble small evaluation sets from publicly available resources (Quran translations, Wikipedia, and parallel corpora), define three evaluation tasks (translation, factual question answering, sentiment classification), and run a reproducible, open experimental protocol comparing open-source models (mBERT, mT5-small, BLOOM-560M) and closed-source APIs (GPT-4, Google Translate). Our analysis highlights a large performance gap between Farsi and more regional/minority languages (Mazandarani, Gilaki, Ossetian), documents common failure modes (cultural mistranslation, hallucinations, dialect confusions), and proposes practical steps toward closing the gap including community-led data collection and lightweight adaptation techniques.