Skip to yearly menu bar Skip to main content


Invited Talk by Hilaria Cruz
in
Competition: Second AmericasNLP Competition: Speech-to-Text Translation for Indigenous Languages of the Americas

Challenges in Achieving a Corpus Infrastructure to Advance Research in Computational Linguistics and Natural Language Processing in Native American Languages


Abstract:

Natural Language Processing researchers and computational linguists frequently express disappointment and frustration over the lack of corpus in endangered languages that they can use to train and test their language models. This hindrance, caused in large part by a dwindling number of speakers and language keepers to create new data such as stories, prayers, political speeches, and everyday conversation. Coupled with this is the severe lack of capacity among speakers of endangered languages to prepare a corpus including transcribers, annotators, and translators. What can NLP researchers do to help create and facilitate the corpus in these languages? Collaborating with communities to increase capacity to develop corpora with members would be a first step. Furthermore, teaching basic programming courses in local high schools and colleges, working with legacy materials in language archives, and doing fieldwork to collect data alongside community members would greatly enhance the creation of endangered language corpora for NLP.

Chat is not available.