Skip to yearly menu bar Skip to main content


OccGen: Selection of Real-world Multilingual Parallel Data Balanced in Gender within Occupations

Marta Costa-jussà · Christine Basta · Oriol Domingo · André Rubungo

Hall J (level 1) #1011

Keywords: [ Gender ] [ Balanced Multilingual Data Set ] [ machine translation ] [ Occupations ]


This paper describes the OCCGEN toolkit, which allows extracting multilingual parallel data balanced in gender within occupations. OCCGEN can extract datasets that reflect gender diversity (beyond binary) more fairly in society to be further used to explicitly mitigate occupational gender stereotypes. We propose two use cases that extract evaluation datasets for machine translation in four high-resourcelanguages from different linguistic families and in a low-resource African language. Our analysis of these use cases shows that translation outputs in high-resource languages tend to worsen in feminine subsets (compared to masculine). This can be explained because less attention is paid to the source sentence. Then, more attention is given to the target prefix overgeneralizing to the most frequent masculine forms.

Chat is not available.