Timezone: »

Your Dataset is a Multiset and You Should Compress it Like One
Daniel Severo · James Townsend · Ashish Khisti · Alireza Makhzani · Karen Ullrich

Tue Dec 14 09:00 AM -- 09:10 AM (PST) @

Neural Compressors (NCs) are codecs that leverage neural networks and entropy coding to achieve competitive compression performance for images, audio, and other data types. These compressors exploit parallel hardware, and are particularly well suited to compressing i.i.d. batches of data. The average number of bits needed to represent each example is at least the well-known cross-entropy. However, the cross-entropy bound assumes the order of the compressed examples in a batch is preserved, which in many applications is not necessary. The number of bits used to implicitly store the order information is the logarithm of the number of unique permutations of the dataset. In this work, we present a method that reduces the bitrate of any codec by exactly the number of bits needed to store the order, at the expense of shuffling the dataset in the process. Conceptually, our method applies bits-back coding to a latent variable model with observed symbol counts (i.e. multiset) and a latent permutation defining the ordering, and does not require retraining any models. We present experiments with both lossy off-the-shelf codecs (WebP) as well as lossless NCs. On Binarized MNIST, lossless NCs achieved savings of up to $7.6\%$, while adding only $10\%$ extra compute time.

#### Author Information

##### Karen Ullrich (Facebook AI Research)

Research scientist (s/h) at FAIR NY + collab. w/ Vector Institute. ❤️ Deep Learning + Information Theory. Previously, Machine Learning PhD at UoAmsterdam.