Contrastive learning has emerged as a powerful form of unsupervised representation learning for images. The utility of learned representations for downstream tasks depends strongly on the chosen augmentation operations. Taking inspiration from biology, we here study contrastive learning through time (CLTT), that works completely without any augmentation operations. Instead, positive pairs of images are generated from temporally close video frames during extended naturalistic interaction with objects. We propose a family of CLTT algorithms based on state-of-the-art contrastive learning methods and test them on three data sets. We demonstrate that CLTT allows linear classification performance that approaches that of the fully supervised setting. We also consider temporal structure resulting from one object being seen systematically before or after another object. We show that this leads to increased representational similarity between these objects ("close in time, will align"), matching classic biological findings. The data sets and code for this paper can be downloaded at: (to be added for final version).