Poster
in
Affinity Event: Global South AI

Uncovering the Potential of Small Language Models

Christine Mwase

Keywords: small language models digital language divide Synthetic Data

Project Page [ Poster] [ OpenReview]

Abstract

Large language models (LLMs) have revolutionized the field of artificial intelligence (AI), showcasing remarkable capabilities across various domains, including generating creative text and solving mathematical problems. Nevertheless, their enormous data and computational requirements have led to steep development costs. This constraint has resulted in the exclusion of many languages commonly spoken in the developing regions of the world, as the compute-, bandwidth- and labour-intensive task of developing appropriate training datasets poses limitations for researchers based in those regions. This constraint also impedes research progress in tackling the technical, ethical, and legal challenges that may arise, and that are likely to disproportionately affect those regions. In recent studies, researchers working on the English language utilized GPT-3.5 and GPT-4 to construct a small synthetic dataset of short stories consisting of vocabulary familiar to 3 to 4-year-olds. This dataset was used to train small language models (SLMs) that are orders of magnitude smaller than LLMs. Despite their reduced complexity, these SLMs produced coherent stories with diverse content spanning multiple paragraphs exhibiting almost perfect grammar, and delivered advantages beyond the simplification of training data generation. Drawing inspiration from these achievements, and recognizing their potential in addressing the digital language divide, we propose to investigate whether SLMs can be equally effective with other languages and within resource constraints faced by researchers in developing regions.

Chat is not available.