Keywords: [ ENLSP-Main ]
Unsupervised translation generally refers to the challenging task of translating between two languages without parallel translations, i.e., from two separate monolingual corpora.In this work, we propose an information-theoretic framework of unsupervised translation that can be well suited even for the case where the source language is that of highly intelligent animals, such as whales, and the target language is a human language, such as English.We identify two conditions that combined allow for unsupervised translation: (1) there is access to an prior distribution over the target language that estimates the likelihood that a sentence was translated from the source language, and (2) most alterations of translations are deemed implausible by the prior. We then give an (inefficient) algorithm which, given access to the prior and unlabeled source examples as input, outputs a provably accurate translation function. We prove upper bounds on the number of samples needed by our algorithm. Surprisingly, our analysis suggests that the amount of source data required for unsupervised translation is not significantly greater than that of supervised translation.To support the viability of our theory, we propose a simplified probabilistic language model: the random sub-tree language model, in which sentences correspond to paths in a randomly-labeled tree. We prove that random sub-tree languages satisfy conditions (1-2) with high probability, and are therefore translatable by our algorithm.Our theory is motivated by a recent initiative to translate whale communication using modern machine translation techniques. The recordings of whale communications that are being collected have no parallel human-language data. Our work seeks to inform this ambitious effort by modeling unsupervised translation. We are further motivated by recent empirical work, reported in the machine learning literature, demonstrating that unsupervised translation is possible in certain settings.