Most (3D) multi-object tracking methods rely on object-level information, e.g. appearance, for data association. By contrast, we investigate how far we can get by considering only geometric relationships and interactions between objects over time. We represent each tracking sequence as a multiplex graph where 3D object detections are nodes, and spatial and temporal pairwise relations among them are encoded via two types of edges. This structure allows our graph neural network to consider all types of interactions and distinguish temporal, contextual and motion cues to obtain final scene interpretation by posing tracking as edge classification. The model outputs classification results after multiple rounds of neural message passing, during which it is able to reason about long-term object trajectories, influences and motion based solely on initial pairwise relationships. To enable our method for online (streaming) scenarios, we introduce a technique to continuously evolve our graph over long tracking sequences to achieve good performance while maintaining sparsity with linear complexity for the number of edges. We establish a new state-of-the-art on the nuScenes dataset.