Abstract:
We use video sequences produced by tracking as training data to learn invariant features. These features are spatial instead of temporal, and well suited to extract from still images. With a temporal coherence objective, a multi-layer neural network encodes invariance that grow increasingly complex with layer hierarchy. Without fine-tuning with labels, we achieve competitive performance on five non-temporal image datasets and state-of-the-art classification accuracy 61% on STL-10 object recognition dataset.
Chat is not available.
Successful Page Load