We present the use of self-supervised learning to explore and exploit large unlabeled datasets. Focusing on 42 million galaxy images from the latest data release of the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys, we first train a self-supervised model to distil low-dimensional representations that are robust to symmetries, uncertainties, and noise in each image. We then use the representations to construct and publicly release an interactive semantic similarity search tool. We demonstrate how our tool can be used to rapidly discover rare objects given only a single example, increase the speed of crowd-sourcing campaigns, flag bad data, and construct and improve training sets for supervised applications. While we focus on images from sky surveys, the technique is straightforward to apply to any scientific dataset of any dimensionality. The similarity search web app can be found at
George Stein (UC Berkeley)
My research focuses on using machine learning to utilize large datasets. After completing my PhD on Astrophysics at the University of Toronto in computational cosmology and machine learning I joined UC Berkeley and Lawrence Berkeley National Laboratory to develop and apply machine learning methods to extract information from datasets across physics and astronomy.
Related Events (a corresponding poster, oral, or spotlight)
2021 : Session 2 | Contributed talk: George Stein, "Self-supervised similarity search for large scientific datasets" »
Mon. Dec 13th 06:45 -- 07:00 PM Room