Outstanding Paper

[ Hall J ] Working with any gradientbased machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model parameters by manually deriving expressions for "hypergradients" ahead of time.We show how to automatically compute hypergradients with a simple and elegant modification to backpropagation. This allows us to easily apply the method to other optimizers and hyperparameters (e.g. momentum coefficients). We can even recursively apply the method to its own hyperhyperparameters, and so on ad infinitum. As these towers of optimizers grow taller, they become less sensitive to the initial choice of hyperparameters. We present experiments validating this for MLPs, CNNs, and RNNs. Finally, we provide a simple PyTorch implementation of this algorithm (see http://people.csail.mit.edu/kach/gradientdescenttheultimateoptimizer). 
Outstanding Paper

[ Hall J ] Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: outofdistribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms. To study the generalization of OOD detection, in this paper, we investigate the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we also offer theoretical supports for several representative OOD detection works based on our OOD theory. 
Outstanding Paper

We argue that the theory and practice of diffusionbased generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new stateoftheart FID of 1.79 for CIFAR10 in a classconditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pretrained score networks from previous work, including improving the FID of a previously trained ImageNet64 model from 2.07 to nearSOTA 1.55, and after retraining with our proposed improvements to a new SOTA of 1.36. 
Outstanding Paper

[ Hall J ]
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for computeoptimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted computeoptimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more data. Chinchilla uniformly and significantly outperformsGopher (280B), GPT3 (175B), Jurassic1 (178B), and MegatronTuring NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for finetuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a stateoftheart average accuracy of 67.5% on the MMLU benchmark, a 7% improvement over Gopher.

Outstanding Paper

[ Hall J ] Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling instead if we have access to a highquality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this improved scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling in practice on ResNets trained on CIFAR10, SVHN, and ImageNet. Next, given the importance of finding highquality pruning metrics, we perform the first largescale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable selfsupervised pruning metric that demonstrates comparable performance to the best supervised metrics. … 
Outstanding Paper

We study the scaling limits of stochastic gradient descent (SGD) with constant stepsize in the highdimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finitedimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the stepsize. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We find a critical scaling regime for the stepsize below which this ``effective dynamics" matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via twolayer networks for binary and XORtype Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to suboptimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. 
Outstanding Paper

[ Hall J ]
Societal and realworld considerations such as robustness, fairness, social welfare and multiagent tradeoffs have given rise to multidistribution learning paradigms, such as collaborative [Blum et al. 2017], group distributionally robust [Sagawa et al. 2019], and fair federated learning [Mohri et al. 2019]. In each of these settings, a learner seeks to minimize its worstcase loss over a set of $n$ predefined distributions, while using as few samples as possible. In this paper, we establish the optimal sample complexity of these learning paradigms and give algorithms that meet this sample complexity. Importantly, our sample complexity bounds exceed that of the sample complexity of learning a single distribution only by an additive factor of $\frac{n\log(n)}{\epsilon^2}$. These improve upon the best known sample complexity of agnostic federated learning by Mohri et al. 2019 by a multiplicative factor of $n$, the sample complexity of collaborative learning by Nguyen and Zakynthinou 2018 by a multiplicative factor $\frac{\log(n)}{\epsilon^3}$, and give the first sample complexity bounds for the group DRO objective of Sagawa et al. 2019. To achieve optimal sample complexity, our algorithms learn to sample and learn from distributions on demand. Our algorithm design and analysis extends stochastic optimization techniques to solve zerosum games in a …

Outstanding Paper

[ Hall J ] We present Imagen, a texttoimage diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in highfidelity image generation. Our key discovery is that generic large language models (e.g., T5), pretrained on textonly corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model. Imagen achieves a new stateoftheart FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in imagetext alignment. To assess texttoimage models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for texttoimage models. With DrawBench, we compare Imagen with recent methods including VQGAN+CLIP, Latent Diffusion Models, and DALLE 2, and find that human raters prefer Imagen over other models in sidebyside comparisons, both in terms of sample quality and imagetext alignment. 
Outstanding Paper

[ Hall J ] Gradient estimationapproximating the gradient of an expectation with respect to the parameters of a distributionis central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leaveoneout estimator. Our control variates can be adapted online to minimize variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than stateoftheart estimators with the same number of function evaluations. 
Outstanding Paper

[ Hall J ] Strong inductive biases give humans the ability to quickly learn to perform a variety of tasks. Although metalearning is a method to endow neural networks with useful inductive biases, agents trained by metalearning may sometimes acquire very different strategies from humans. We show that cotraining these agents on predicting representations from natural language task descriptions and programs induced to generate such tasks guides them toward more humanlike inductive biases. Humangenerated language descriptions and program induction models that add new learned primitives both contain abstract concepts that can compress description length. Cotraining on these representations result in more humanlike behavior in downstream metareinforcement learning agents than less abstract controls (synthetic language descriptions, program induction without learned primitives), suggesting that the abstraction supported by these representations is key. 
Outstanding Paper

[ Hall J ] Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a largescale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse openended tasks and an internetscale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pretrained videolanguage models as a learned reward function. Our agent is able to solve a variety of openended tasks specified in freeform language without any manually designed dense shaping reward. We opensource the simulation suite, knowledge bases, algorithm implementation, and pretrained models (https://minedojo.org) to promote research towards the goal of generally capable embodied agents. 
Outstanding Paper

[ Hall J ] Scorebased generative models (SGMs) are a powerful class of generative models that exhibit remarkable empirical performance.Scorebased generative modelling (SGM) consists of a 
Outstanding Paper

[ Hall J ] Current stateoftheart document retrieval solutions mainly follow an indexretrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an endtoend deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequencetosequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefixaware weightadaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistencybased regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and RPrecision on TriviaQA dataset, respectively, compared to the best baseline method. 
Outstanding Paper

[ Hall J ] Massive datasets and highcapacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose ProcTHOR, a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks. We demonstrate the power and potential of ProcTHOR via a sample of 10,000 generated houses and a simple neural model. Models trained using only RGB images on ProcTHOR, with no explicit mapping and no human task supervision produce stateoftheart results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation, including the presently running Habitat 2022, AI2THOR Rearrangement 2022, and RoboTHOR challenges. We also demonstrate strong 0shot results on these benchmarks, via pretraining on ProcTHOR with no finetuning on the downstream benchmark, often beating previous stateoftheart systems that access the downstream training data. 
Outstanding Paper

[ Hall J ] Groundbreaking languagevision architectures like CLIP and DALLE proved the utility of training on large amounts of noisy imagetext data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong textguided image generation and transfer to downstream tasks, while performing remarkably at zeroshot classification with noteworthy outofdistribution robustness. Since then, largescale languagevision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of imagetext pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on largescale multimodal models, we present LAION5B  a dataset consisting of 5.85 billion CLIPfiltered imagetext pairs, of which 2.32B contain English language. We show successful replication and finetuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved webinterface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. 