Natural environments have temporal structure at multiple timescales, a property that is reflected in biological learning and memory but typically not in machine learning systems. This paper advances a multiscale learning model in which each weight in a neural network is a sum of subweights learning independently at different timescales. A special case of this model is a fast-weights scheme, in which each original weight is augmented with a fast weight that rapidly learns and decays, enabling adaptation to distribution shifts during online learning. We then prove that more complicated models that assume coupling between timescales are equivalent to the multiscale learner, via a reparameterization that eliminates the coupling. Finally, we prove that momentum learning is equivalent to fast weights with a negative learning rate, offering a new perspective on how and when momentum is beneficial.

Chat is not available.