Praveen Tangarajan and Madhvi Sharma
Abstract
Large Language Models (LLMs) have achieved remarkable progress in natural language understanding and generation, yet their performance remains uneven for low-resource Indian languages, where sparse digitized corpora, rich linguistic diversity, and deeply context-dependent cultural norms challenge conventional instruction-tuning and alignment pipelines. Most existing alignment methods are developed primarily for English and other high-resource languages embed assumptions about discourse structure, politeness, reasoning patterns, and interaction norms that do not transfer reliably to languages such as Assamese, Odia, Kannada, Manipuri, and many other regional Indian languages. As a result, current multilingual LLMs often produce responses that are factually inaccurate, culturally inappropriate, or misaligned with local linguistic expectations. These limitations not only degrade model usability for native speakers but also introduce safety concerns stemming from misinterpretation of social cues, values, and context. This work presents a comprehensive framework for culturally aware instruction and alignment designed specifically for low-resource Indian-language LLMs. The framework integrates three core components: (1) culture-sensitive instruction design, (2) community-informed data curation, and (3) hybrid alignment strategies grounded in both human feedback and automated evaluators trained on culturally relevant signals. Culture-sensitive instruction design ensures that dataset prompts reflect linguistic nuance, region-specific discourse conventions, politeness norms, and localized domain knowledge. Community-informed data curation adopts participatory methodologies that involve native speakers, educators, and linguists in validating translations, generating culturally anchored prompts, and correcting biases inherited from dominant-language datasets. Hybrid alignment strategies introduce techniques such as culturally weighted reward modeling, India-specific preference datasets, and iterative alignment loops that adapt model behavior to regional expectations while maintaining general reasoning capabilities. We situate this framework within recent advances such as the South Asian Instruction Dataset (SAID), selective translation pipelines for bilingual tuning, and post-training datasets like Pragyaan that combine human curation, automated expansion, and synthetic generation to build culturally grounded instruction-following corpora across multiple Indian languages. Parameter-efficient fine-tuning of multilingual base models, when combined with these culturally enriched datasets, enables scalable adaptation without compromising performance in high-resource languages. Experimental evaluations across several low-resource Indian languages demonstrate marked improvements in cultural fidelity, instruction adherence, multilingual fluency, safety, and perceived helpfulness. Notably, hybrid datasets mixing English and native-language samples outperform purely monolingual sets by preserving high-resource reasoning strengths while enhancing cultural specificity. Overall, our findings highlight the necessity of embedding cultural context directly into the instruction-tuning and alignment pipeline to develop equitable, context-aware multilingual LLMs. By elevating local linguistic practices, incorporating community expertise, and designing culturally grounded evaluation signals, this work outlines a scalable methodology for building culturally aligned LLMs that more authentically serve India’s complex multilingual landscape and can be generalized to other low-resource and marginalized language communities worldwide.