Timezone: »

An Autoencoder Approach to Learning Bilingual Word Representations
Sarath Chandar · Stanislas Lauly · Hugo Larochelle · Mitesh Khapra · Balaraman Ravindran · Vikas C Raykar · Amrita Saha

Mon Dec 08 04:00 PM -- 08:59 PM (PST) @ Level 2, room 210D

Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.

Author Information

Sarath Chandar (Mila / Polytechnique Montreal)
Stanislas Lauly (NYU)
Hugo Larochelle (Google DeepMind)
Mitesh Khapra (IBM India Research Lab)
Balaraman Ravindran (Indian Institute of Technology Madras)
Vikas C Raykar (IBM Research)
Amrita Saha (IBM India Research Lab)

More from the Same Authors