Timezone: »

SautiDB-Naija: A Nigerian L2 English Speech Dataset

In this paper, we introduce SautiDB-Naija, a speech corpus of non-native speakers of English intended for research in accent translation, voice conversion, pronunciation classification, and accent classification. This initial release of our corpus includes over 900 recordings of non-native speakers of English whose first language (L1) is amongst the most common in Nigeria, namely Yoruba, Igbo, Edo, Efik-Ibibio, and Igala. To the best of our knowledge, this would be the first documented effort to curate a corpus of Nigerian accents for machine learning research to date. We demonstrate that neural networks are capable of learning linguistic features that distinguish between different accent classes by training a discriminative classifier on our corpus. This demonstrates the potential of SautiDB-Naija as a valuable resource for future computational linguistic research.

Related Events (a corresponding poster, oral, or spotlight)