Timezone: »
Your Dataset is a Multiset and You Should Compress it Like One
Daniel Severo · James Townsend · Ashish Khisti · Alireza Makhzani · Karen Ullrich
Event URL: https://openreview.net/forum?id=vjrsNCu8Km »
Neural Compressors (NCs) are codecs that leverage neural networks and entropy coding to achieve competitive compression performance for images, audio, and other data types. These compressors exploit parallel hardware, and are particularly well suited to compressing i.i.d. batches of data. The average number of bits needed to represent each example is at least the well-known cross-entropy. However, the cross-entropy bound assumes the order of the compressed examples in a batch is preserved, which in many applications is not necessary. The number of bits used to implicitly store the order information is the logarithm of the number of unique permutations of the dataset. In this work, we present a method that reduces the bitrate of any codec by exactly the number of bits needed to store the order, at the expense of shuffling the dataset in the process. Conceptually, our method applies bits-back coding to a latent variable model with observed symbol counts (i.e. multiset) and a latent permutation defining the ordering, and does not require retraining any models. We present experiments with both lossy off-the-shelf codecs (WebP) as well as lossless NCs. On Binarized MNIST, lossless NCs achieved savings of up to $7.6\%$, while adding only $10\%$ extra compute time.
Neural Compressors (NCs) are codecs that leverage neural networks and entropy coding to achieve competitive compression performance for images, audio, and other data types. These compressors exploit parallel hardware, and are particularly well suited to compressing i.i.d. batches of data. The average number of bits needed to represent each example is at least the well-known cross-entropy. However, the cross-entropy bound assumes the order of the compressed examples in a batch is preserved, which in many applications is not necessary. The number of bits used to implicitly store the order information is the logarithm of the number of unique permutations of the dataset. In this work, we present a method that reduces the bitrate of any codec by exactly the number of bits needed to store the order, at the expense of shuffling the dataset in the process. Conceptually, our method applies bits-back coding to a latent variable model with observed symbol counts (i.e. multiset) and a latent permutation defining the ordering, and does not require retraining any models. We present experiments with both lossy off-the-shelf codecs (WebP) as well as lossless NCs. On Binarized MNIST, lossless NCs achieved savings of up to $7.6\%$, while adding only $10\%$ extra compute time.
Author Information
Daniel Severo (University of Toronto)
James Townsend (University College London)
Ashish Khisti (University of Toronto)
Alireza Makhzani (University of Toronto)
Karen Ullrich (Facebook AI Research)
Research scientist (s/h) at FAIR NY + collab. w/ Vector Institute. ❤️ Deep Learning + Information Theory. Previously, Machine Learning PhD at UoAmsterdam.
Related Events (a corresponding poster, oral, or spotlight)
-
2021 : Your Dataset is a Multiset and You Should Compress it Like One »
Tue. Dec 14th 05:00 -- 05:10 PM Room
More from the Same Authors
-
2021 Spotlight: Lossy Compression for Lossless Prediction »
Yann Dubois · Benjamin Bloem-Reddy · Karen Ullrich · Chris Maddison -
2021 : Adaptive Optimization with Examplewise Gradients »
Julius Kunze · James Townsend · David Barber -
2021 : Few Shot Image Generation via Implicit Autoencoding of Support Sets »
Shenyang Huang · Kuan-Chieh Wang · Guillaume Rabusseau · Alireza Makhzani -
2021 : Cross-Domain Lossy Compression as Optimal Transport with an Entropy Bottleneck »
Huan Liu · George Zhang · Jun Chen · Ashish Khisti -
2022 : Action Matching: A Variational Method for Learning Stochastic Dynamics from Samples »
Kirill Neklyudov · Daniel Severo · Alireza Makhzani -
2021 Poster: Universal Rate-Distortion-Perception Representations for Lossy Compression »
George Zhang · Jingjing Qian · Jun Chen · Ashish Khisti -
2021 Poster: Lossy Compression for Lossless Prediction »
Yann Dubois · Benjamin Bloem-Reddy · Karen Ullrich · Chris Maddison -
2021 Poster: Variational Model Inversion Attacks »
Kuan-Chieh Wang · YAN FU · Ke Li · Ashish Khisti · Richard Zemel · Alireza Makhzani -
2020 Poster: Coded Sequential Matrix Multiplication For Straggler Mitigation »
Nikhil Krishnan Muralee Krishnan · Seyederfan Hosseini · Ashish Khisti -
2019 Poster: Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates »
Jeffrey Negrea · Mahdi Haghifam · Gintare Karolina Dziugaite · Ashish Khisti · Daniel Roy -
2017 Poster: PixelGAN Autoencoders »
Alireza Makhzani · Brendan J Frey -
2017 Poster: Bayesian Compression for Deep Learning »
Christos Louizos · Karen Ullrich · Max Welling -
2015 Poster: Winner-Take-All Autoencoders »
Alireza Makhzani · Brendan J Frey