HealthAlign-Agents: Self-Play Reflective Prompting for Culturally Aligned Health Communication in Low-Resource Languages
Abstract
Large Language Models (LLMs) have shown impressive performance in medical and health-related question answering, yet their dependence on fine-tuning with large, English-centric datasets limits their effectiveness and cultural alignment in low-resource settings. This work introduces HealthAlign-Agents, a fine-tuning-free, agentic self-play framework that enables accurate and culturally grounded health communication across underrepresented languages. Unlike conventional biomedical LLMs such as Med-PaLM 2 [Singhal et al.,2023] or BioMistral [Nguyen et al., 2024], our approach performs continual improvement entirely through prompt-based reflection rather than gradient updates. The system consists of three collaborative agents—a Patient-Agent simulating realistic health queries in local dialects, a Health-Advisor-Agent retrieving and summarizing verified guidance from structured repositories (e.g., WHO, PubMed, and national health protocols), and a Health-Judge-Agent evaluating generated advice for factuality, empathy, and cultural appropriateness. Through iterative self-play cycles, these agents refine one another: the Judge produces structured critiques that are embedded into the Advisor’s future prompts, while the Patient-Agent dynamically evolves its questions to probe ambiguous or culturally sensitive scenarios.
Methodologically, HealthAlign-Agents formalizes multilingual health communication as a multi-agent reflective optimization process where each agent jointly contributes to both knowledge grounding and meta-cognitive regulation. Each self-play cycle proceeds through three structured reasoning stages: (1) Knowledge grounding—the Advisor uses retrieval-augmented prompting to construct con- textually constrained responses from authoritative health sources, followed by semantic filtering that removes unsupported claims and aligns terminology with local vernaculars; (2) Reflective critique—the Judge, instantiated as a secondary LLM equipped with a factual empathy dual-objective head, evaluates the Advisor’s output to generate a structured feedback tensor vcrit = [rf , re, rc], corresponding to factual accuracy, empathetic tone, and cultural relevance; and (3) Prompt refinement—the critique vector is embedded into a meta-prompt template Pt+1 = f(Pt, vcrit), allowing the Advisor to update its reasoning trajectory for the next iteration. To maintain diversity and prevent over-correction, a stochastic temperature-controlled replay buffer stores past critiques, enabling episodic reflection reminiscent of human memory consolidation.