Track: Deep Learning

Wed 6 Dec. 10:20 - 10:35 PST

Oral

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Wei Wen · Cong Xu · Feng Yan · Chunpeng Wu · Yandan Wang · Yiran Chen · Hai Li

High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels {-1,0,1} which can aggressively reduce the communication time. We mathematically prove the convergence of TernGrad under the assumption of a bound on gradients. Guided by the bound, we propose layer-wise ternarizing and gradient clipping to improve its convergence. Our experiments show that applying TernGrad on AlexNet doesn’t incur any accuracy loss and can even improve accuracy. The accuracy loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a performance model is proposed to study the scalability of TernGrad. Experiments show significant speed gains for various deep neural networks.

Wed 6 Dec. 10:35 - 10:50 PST

Oral

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Elad Hoffer · Itay Hubara · Daniel Soudry

Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmicaly with the number of weight updates. We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.

Wed 6 Dec. 10:50 - 11:05 PST

Oral

End-to-end Differentiable Proving

Tim Rocktäschel · Sebastian Riedel

We introduce deep neural networks for end-to-end differentiable theorem proving that operate on dense vector representations of symbols. These neural networks are recursively constructed by following the backward chaining algorithm as used in Prolog. Specifically, we replace symbolic unification with a differentiable computation on vector representations of symbols using a radial basis function kernel, thereby combining symbolic reasoning with learning subsymbolic vector representations. The resulting neural network can be trained to infer facts from a given incomplete knowledge base using gradient descent. By doing so, it learns to (i) place representations of similar symbols in close proximity in a vector space, (ii) make use of such similarities to prove facts, (iii) induce logical rules, and (iv) it can use provided and induced logical rules for complex multi-hop reasoning. On four benchmark knowledge bases we demonstrate that this architecture outperforms ComplEx, a state-of-the-art neural link prediction model, while at the same time inducing interpretable function-free first-order logic rules.

Wed 6 Dec. 11:05 - 11:20 PST

Oral

Gradient descent GAN optimization is locally stable

Vaishnavh Nagarajan · J. Zico Kolter

Despite their growing prominence, optimization in generative adversarial networks (GANs) is still a poorly-understood topic. In this paper, we analyze the "gradient descent'' form of GAN optimization (i.e., the natural setting where we simultaneously take small gradient steps in both generator and discriminator parameters). We show that even though GAN optimization does \emph{not} correspond to a convex-concave game even for simple parameterizations, under proper conditions, equilibrium points of this optimization procedure are still \emph{locally asymptotically stable} for the traditional GAN formulation. On the other hand, we show that the recently-proposed Wasserstein GAN can have non-convergent limit cycles near equilibrium. Motivated by this stability analysis, we propose an additional regularization term for gradient descent GAN updates, which \emph{is} able to guarantee local stability for both the WGAN and for the traditional GAN, and which also shows practical promise in speeding up convergence and addressing mode collapse.

Wed 6 Dec. 11:20 - 11:25 PST

Spotlight

f-GANs in an Information Geometric Nutshell

Richard Nock · Zac Cranko · Aditya K Menon · Lizhen Qu · Robert Williamson

Nowozin \textit{et al} showed last year how to scale the GANs \textit{principle} to all $f$-divergences. The approach is elegant but falls short of a full description of the supervised game, and says nothing about the key player, the generator: for example, what does the generator actually fit if solving the GAN game means convergence in some space of parameters? How does that hint on the generator's design and compare to the flourishing, essentially experimental literature on the subject? In this paper, we unveil the broad class of densities for which such convergence happens and show tight connections with the three other key GAN parameters: loss, game and model. In particular, we show that current deep architectures are able to factor a potentially very large number of such densities, hence displaying the power of deep architectures and their adequation to the $f$-GAN game. This result holds provided a sufficient condition on \textit{activation functions} is satisfied --- and it turns out to be satisfied by most popular choices. The key to our results is a variational generalization of an old theorem that relates the KL divergence between regular exponential families and divergences between their natural parameters. We complete this picture with additional results and experimental insights on how these results may be used to ground further improvements of GAN architectures.

Wed 6 Dec. 11:25 - 11:30 PST

Spotlight

Unsupervised Image-to-Image Translation Networks

Ming-Yu Liu · Thomas Breuel · Jan Kautz

Most of the existing image-to-image translation frameworks---mapping an image in one domain to a corresponding image in another---are based on supervised learning, i.e., pairs of corresponding images in two domains are required for learning the translation function. This largely limits their applications, because capturing corresponding images in two different domains is often a difficult task. To address the issue, we propose the UNsupervised Image-to-image Translation (UNIT) framework. The proposed framework is based on variational autoencoders and generative adversarial networks. It can learn the translation function without any corresponding images. We show this learning capability is enabled by combining a weight-sharing constraint and an adversarial objective and verify the effectiveness of the proposed framework through extensive experiment results.

Wed 6 Dec. 11:30 - 11:35 PST

Spotlight

The Numerics of GANs

Lars Mescheder · Sebastian Nowozin · Andreas Geiger

In this paper, we analyze the numerics of common algorithms for training Generative Adversarial Networks (GANs). Using the formalism of smooth two-player games we analyze the associated gradient vector field of GAN training objectives. Our findings suggest that the convergence of current algorithms suffers due to two factors: i) presence of eigenvalues of the Jacobian of the gradient vector field with zero real-part, and ii) eigenvalues with big imaginary part. Using these findings, we design a new algorithm that overcomes some of these limitations and has better convergence properties. Experimentally, we demonstrate its superiority on training common GAN architectures and show convergence on GAN architectures that are known to be notoriously hard to train.

Wed 6 Dec. 11:35 - 11:40 PST

Spotlight

Dual Discriminator Generative Adversarial Nets

Tu Nguyen · Trung Le · Hung Vu · Dinh Phung

We propose in this paper a novel approach to tackle the problem of mode collapse encountered in generative adversarial network (GAN). Our idea is intuitive but proven to be very effective, especially in addressing some key limitations of GAN. In essence, it combines the Kullback-Leibler (KL) and reverse KL divergences into a unified objective function, thus it exploits the complementary statistical properties from these divergences to effectively diversify the estimated density in capturing multi-modes. We term our method dual discriminator generative adversarial nets (D2GAN) which, unlike GAN, has two discriminators; and together with a generator, it also has the analogy of a minimax game, wherein a discriminator rewards high scores for samples from data distribution whilst another discriminator, conversely, favoring data from the generator, and the generator produces data to fool both two discriminators. We develop theoretical analysis to show that, given the maximal discriminators, optimizing the generator of D2GAN reduces to minimizing both KL and reverse KL divergences between data distribution and the distribution induced from the data generated by the generator, hence effectively avoiding the mode collapsing problem. We conduct extensive experiments on synthetic and real-world large-scale datasets (MNIST, CIFAR-10, STL-10, ImageNet), where we have made our best effort to compare our D2GAN with the latest state-of-the-art GAN's variants in comprehensive qualitative and quantitative evaluations. The experimental results demonstrate the competitive and superior performance of our approach in generating good quality and diverse samples over baselines, and the capability of our method to scale up to ImageNet database.

Wed 6 Dec. 11:40 - 11:45 PST

Spotlight

Bayesian GANs

Yunus Saatci · Andrew Wilson

Generative adversarial networks (GANs) can implicitly learn rich distributions over images, audio, and data which are hard to model with an explicit likelihood. We present a practical Bayesian formulation for unsupervised and semi-supervised learning with GANs. We use stochastic gradient Hamiltonian Monte Carlo to marginalize the weights of generator and discriminator networks. The resulting approach is straightforward and obtains good performance without any standard interventions such as feature matching, or mini-batch discrimination. By exploring an expressive posterior over the parameters of the generator, the Bayesian GAN avoids mode-collapse, produces interpretable candidate samples with notable variability, and in particular provides state-of-the-art quantitative results for semi-supervised learning on benchmarks including SVHN, CelebA, and CIFAR-10, outperforming DCGAN, Wasserstein GANs, and DCGAN ensembles.

Wed 6 Dec. 11:45 - 11:50 PST

Spotlight

Approximation and Convergence Properties of Generative Adversarial Learning

Shuang Liu · Olivier Bousquet · Kamalika Chaudhuri

Generative adversarial networks (GAN) approximate a target data distribution by jointly optimizing an objective function through a "two-player game" between a generator and a discriminator. Despite their empirical success, however, two very basic questions on how well they can approximate the target distribution remain unanswered. First, it is not known how restricting the discriminator family affects the approximation quality. Second, while a number of different objective functions have been proposed, we do not understand when convergence to the global minima of the objective function leads to convergence to the target distribution under various notions of distributional convergence. In this paper, we address these questions in a broad and unified setting by defining a notion of adversarial divergences that includes a number of recently proposed objective functions. We show that if the objective function is an adversarial divergence with some additional conditions, then using a restricted discriminator family has a moment-matching effect. Additionally, we show that for objective functions that are strict adversarial divergences, convergence in the objective function implies weak convergence, thus generalizing previous results.

Wed 6 Dec. 11:50 - 11:55 PST

Spotlight

Dualing GANs

Yujia Li · Alex Schwing · Kuan-Chieh Wang · Richard Zemel

Generative adversarial nets (GANs) are a promising technique for modeling a distribution from samples. It is however well known that GAN training suffers from instability due to the nature of its maximin formulation. In this paper, we explore ways to tackle the instability problem by dualizing the discriminator. We start from linear discriminators in which case conjugate duality provides a mechanism to reformulate the maximin objective into a maximization problem, such that both the generator and the discriminator of this ‘dualing GAN’ act in concert. We then demonstrate how to extend this intuition to non-linear formulations. For GANs with linear discriminators our approach is able to remove the instability in training, while for GANs with nonlinear discriminators our approach provides an alternative to the commonly used GAN training algorithm.

Wed 6 Dec. 11:55 - 12:00 PST

Spotlight

Generalizing GANs: A Turing Perspective

Roderich Gross · Yue Gu · Wei Li · Melvin Gauci

Recently, a new class of machine learning algorithms has emerged, where models and discriminators are generated in a competitive setting. The most prominent example is Generative Adversarial Networks (GANs). In this paper we examine how these algorithms relate to the famous Turing test, and derive what - from a Turing perspective - can be considered their defining features. Based on these features, we outline directions for generalizing GANs - resulting in the family of algorithms referred to as Turing Learning. One such direction is to allow the discriminators to interact with the processes from which the data samples are obtained, making them "interrogators", as in the Turing test. We validate this idea using two case studies. In the first case study, a computer infers the behavior of an agent while controlling its environment. In the second case study, a robot infers its own sensor configuration while controlling its movements. The results confirm that by allowing discriminators to interrogate, the accuracy of models is improved.

Main Navigation

Session

Deep Learning

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

End-to-end Differentiable Proving

Gradient descent GAN optimization is locally stable

f-GANs in an Information Geometric Nutshell

Unsupervised Image-to-Image Translation Networks

The Numerics of GANs

Dual Discriminator Generative Adversarial Nets

Bayesian GANs

Approximation and Convergence Properties of Generative Adversarial Learning

Dualing GANs

Generalizing GANs: A Turing Perspective