Fairness Dynamics During Training
Abstract
Understanding fairness dynamics during Large Language Model (LLM) training facilitates the diagnoses of biases that emerge and enables developers to mitigate biases through early stopping or other training interventions. We introduce two new metrics to evaluate fairness dynamics holistically during model pre-training: Average Rank and Jensen-Shannon Divergence by Parts. These metrics provide insights into the Pythia models' progression of biases in gender prediction of occupations on the WinoBias dataset. We find that Pythia-6.9b becomes more performant and confident predicting "male" than "female" during training. By monitoring these dynamics, we find that, via early-stopping, Pythia-6.9b can exchange 1.7% accuracy on LAMBADA for a 92.5% increase in fairness. We also find that Pythia-6.9b is more likely than Pythia-160m to exhibit bias and make assumptions about gender, even when a subject's gender is not specified.